Using bootstrap to increase data in predictive analytics with extreme value distribution

Use your smartphone to scan this QR code and download this article ABSTRACT The bootstrap is one of themethod of studying statistical mathwhich this article uses it but is amajor tool for studying and evaluating the values of parameters in probability distribution. Overview of the theory of infinite distribution functions. The tool to deal with the problems raised in the paper is themathematical methods of random analysis by theory of random process andmultivariate statistics. Observations (realisations of a stationary process) are not independent, but dependence in time series is relatively simple example of dependent data. Through a simulation study we found that the pseudo data generated from the bootstrapmethod always showed aweaker dependence among the observations than the time series they were sampled from, hence we can draw the conclusion that even by re-sampling blocks instead of single observations we will lose some of structural from of the original sample. A potential difficulty by the using of likelihood methods for the GEV concerns the regularity conditions that are required for the usual asymptotic properties associated with the maximum likelihood estimator to be valid. To estimate the value of a parameter in GEV we can use classical methods of mathematical statistics such as the maximum likelihood method or the least squares method, but they all require a certain number samples for verification. For the bootstrap method, this is obviously not needed; here we use the limit theorems of probability theory and multivariate statistics to solve the problem even if there is only one sample data. That is the important practical significance that our paper wants to convey. In predictive analysis problems, in case the actual data is incomplete, not long enough, we can use bootstrap to add data.


Block bootstrap methods for time series
Observations (realisations of a stationary process) are not independent, dependence in time series is relatively simple example of dependent data. Block bootstrap methods by time series data have been put forward by Hall, 1985 1 , Kunsch 2 , Singh 3 , Politis and Romano 4 , Lahiri 5 . Let X 1 , X 2 , X n be independent and identically random variables with distribution function F. The basic step in extreme value theory is to investigate the distribution of M n = max(X 1 , X 2 , X n ) as n → ∞. Suppose there is sequence of constants a n > 0, b n ∈ R such that: with G(x) is a non-degenerate distribution function, C(G) is the set of all continuity points of G(x). Limit distribution functions G(x) satisfying equation. The function (1) is the well known extreme value of three types of distributions (Frechet, Weibull, and Gumbel distributions) 6,7 . The generalized extreme value (GEV) family of distribution is: where µ is a location parameter µ ∈ R, σ is a scale parameter (with σ > 0), and ξ is the extreme value shape parameter.

Non overlap Block Bootstrap (NBB)
Blocks of length l: To get a bootstrap sample we do: with replacement, is the largest number less than or equal to n l ).
. NBB fewer blocks than in the MBB 10 . Figure 1 show the blocks of length l, with sample replacement from {B 1 , B 2 , ..., B m } ; lb = m ≈ n, every observation reseives equal weight:

Stationary block bootstrap (SB)
Blocks are no longer of equal size 11 . The bootstrap sample is chosen according to some probability measure on the sequences. The static bootstrap method involves sampling blocks of random lengths by each block with a geometrical distribution:

Properties of Block Bootstrap Methods
Through a simulation study we found that pseudo data achieved from the bootstrap method always displayed a weaker dependence among the observations than the time series they were sampled from, hence we can draw the conclusion that even by resampling blocks instead of single observations we will lose some of structural from of the original sample. The pseudo time series produced by the moving block method is not stationary, even if the original series X t is stationary. The pseudo time series produced by the stationary bootstrap method is actually a stationary time series The mean _ X * N of the moving block boostrap is biased in the sense that: The MBB estimator of the variance of √ n _ X n is biased. This situation creates problems in using the percentile method with the MBB. The usual estimator: Should be modified to: By the modification the bootstrap will be able to improve substantially on the normal approximation.
Comparison the block bootstrap methods: We find that overall the MBB and CBB methods give the estimators with the smallest standard error and the SB method the largest. (Random block length leads to a larger variance of the parameter estimates than for the other methods when block length is fixed). The b The block bootstrap procedure Assume that, the statistic θ estimates is a θ functional, depending on the m-dimensional marginal distribution of the time series data. Now, build vectors of consecutive obsevations Build overlapping block of consecutive vectors, where l ∈ N is the block length parameter. Simplicity, first assume that m + l − 1 = kl with k ∈ N. So that, resample block independently, where the starting points of blocks S 1 , ..., S k are i.i.d. Uniform ({m − 1, ..., n − l}) These resampled blocks of m-vectors could be referred to the block bootstrap sample. On the other hand, as we will concern the bootstrapped block estimator is not simply defined by the plug-in rule and the concept of the bootstrap sample is not clear. If n − m + 1 is not a multiple of l, we resample k = [(n = m + 1) /l] + 1 blocks, but we use only a portion of the k-th block to get n − m + 1 resampled vectors in total.
is a empirical distribution function of the m-dimensional marginal distribution of (X t ) t∈Z , and T is a smooth functional. The block bootstrapped estimator is defined as ,

Science & Technology Development Journal -Engineering and Technology, 3(SI3):SI45-SI51
This definition of the block bootstrapped estimator, can be interpreted as So, we can say that it employs a plus-in rule based on vectorized observations. Choosing an optimal block length The orders of magnitude of the optimal block size are known in some inference problems. According to those authors three different settings of practical importance can be identified; estimation of the bias or variance, estimation of a one-sided distribution function and estimation of a two-sided distribution function.
The optimal block length in the above situations of different size being b ∼ Cn 1 k , with k = 3, 4,… respectively where n is the sample size. This result, will be used here as the basis for the choosing the optimal block length. Two main approaches can be pointed out: A cross validation method proposed by Hall et. al. 12 and a plug-in method based on a recent work of Lahiri 5 . Based on a research of Lahiri, a nonparametric plugin (NPPI) method for selecting the optimal block length in order to reduce the bias will be considered. Unlike traditional plug-in method this method employs nonparametric resampling procedures to estimate the relevant constants in the leading term of the optimal block length. The variance of block bootstrap estimator is an increasing function of the block length l while its bias is a decreasing function of block length. As a result, for each block bootstrap estimator, there is critical value, l 0 n that minimizes mean-square error (MSE). The value of l that minimizes the leading term in the expension of the MSE is denominated MSE-optimal block length. The following notation will be used: The block bootstrap will be able to use either overlapping or non-overlapping blocks. Define one-sided and symmetrical distribution functions of the nor- , and ϕ denote the standard normal density function: al.
show that there are constants Where the symbol~indicates that the quantiny on the right-hand side is the leading term of an asymptotic expansion. The constants C i do not depend on n or l.
The terms involving C 2 ,C 4 and C 6 correspond to the variance. The variance terms smaller if the bocks are overlapping than if they are non-overlapping. Following the expressions (3), (4), and (5), so that the asymptotically optimal block length (in the case of minimizing the AMSE) is l = A 1 n 1/3 for bias or variance estimation, l = A 2 n 1/4 for estimating a one-sided distribution function, and l = A 3 n 1/5 for estimating a symmetrical distribution function (A j > 0; j = 1, 2, 3 are suitable constants that dependents on certain population parametrs).

JACKKNIFE METHOD
We assume a vector of parameters such as θ . The bias of θ as an estimate of an estimat or θ 0 of θ 0 is defined by △ = E θ 0 − θ 0 . A large bias is often an undesirable factor in the estimator's performance. We will able to use the bootstrap to estimate the bias of any estimator θ as an estimate of an estimator θ 0 . We generate B independent bootstrap samples X * 1 ,X * 2 ,...,X * B , each consisting of n data value drawn with replication corresponding to each bootstrap sample from X, as X We can select the sample of B in the range 25 − 1000. Then evaluate of the bootstrap application corresponding to each bootstrap sample, it may be an indication that the statistic. X * (b) = SX * b , b = 1, 2, ..., B. The bootstrap estimate of bias is defined by

Science & Technology Development Journal -Engineering and Technology, 3(SI3):SI45-SI51
We have focused on the standard error as a measure of accuracy for an estimator θ 0 . Estimate the standart error se B θ 0 by the sample standard deviation of the B replications, The jackknife estimate of bias is a alternative method to find out the bias which it was original computer based method for estimating biases and standard errors. If we have a sample set x = (x 1 , x 2 , ..., x n ) and an estimator θ 0 = S (X).The i th jackknife sample x (i) , is defined to be x with i th data point removed, For i = 1,2,...,n, the jackknife estimate of bias is defined by The jackknife usually provides a nice and simple approximation to the bootstrap, for estimation of standard error and bias.
As a basic rule, if a bias of less than 0.25, the standard errors can be ignored, unless we are trying to do careful confidence interval calculations. The root mean square error of an estimator θ for θ is measure of accuracy that takes into regards both bias and standard error. It can be shown that the square root of the square is equal, If ∆ = 0 then the root mean square equals its minimum value of standard error. Reverse, rate △ se = 0.25, then the root mean square error is no more that 0.031 greater than value of standard error. The obvious bias corrected estimator is, When bias is small compared to the estimated standard error se; then it is b safer to use θ 0 than θ corr . Reverse, bias is large compared to standard error, then it may be an indication that the statistic θ 0 = S (X) is not an appropriate estimate of the parameter θ . Quantifying the accuracy of an estimation tool can often be clearer by calculating confidence intervals. A standard result claims that θ 0 is the maximum likelihood estimator has a limiting multivariate normal distribution with mean θ 0 and variance covariance matrix is "Fischer's information matrix". Because the true value of θ 0 is generally unknown, it is usual to approximate the term of I with those of the "Fischer's information matrix" defined by and evaluated at θ = θ . Denoting an arbitrary term in the inverse of I O (θ ) by σ i, j , it follows that an approximate (1 − τ) with 0 < τ < 1, con- Let θ 0 be the maximum likelihood estimator of the k-dimensional parameter θ 0 with approximate variance covariance matrix H θ 0 .
Moreover, a confidence interval can derived from the likelihood function, using approximation It follows that an approximate (1 − τ) confidence region for θ 0 is given by This approximation is usually more accurate than that based on the asymptotic normality of the maximum likelihood estimator.
The log likelihood for θ can be formally written as ) where θ have two components.
So, under suitable regularly conditions, for large n, quantile of the χ 2 1 distribution. Another method of model selection is the Akaike Information Criterion (AIC). The AIC has played a significant role in solving problems in a wide variety of fields as a model selection criterion for analyzing actual data. The AIC is defined by AIC = −2 (maximum log likelihood) +2 (number of free parameters).
The amount of free parameters in a model refers to the dimensions of the parameter vector θ contained in the specified model f (x|θ ).

ANALYSIS FOR GEV DISTRIBUTIONS
An implicit difficulty with the using of likelihood methods for the GEV concerns the regularity conditions that are required for the usual asymptotic properties associated with the maximum likelihood estimator to be valid. Those conditions are not satisfied by the GEV model because the end points of the GEV distribution are functions of the parameter values, µ=− σ ξ is an upper end-point of the distribution when ξ < 0; and a lower end-point when ξ > 0. These offend of the usual regularity conditions means that the standard asymptotic likelihood results are not automatically applicable 13,14 . This problem have studied in our details and obtained the following results (i). if ξ > −0.5 maximum likelihood estimators are regular, in the impression of having the usual, asymptotic properties (ii). when −1 < ξ < −0.5, maximum likelihood estimators are generally obtainable, but do not have the standard asymptotic properties, and (iii). with ξ < −1, maximum likelihood estimators are unlikely to be obtainable. Under the assumption that X 1 , X 2 ,..., X m are independent random variables having the GEV distribution, the log likelihood for the GEV parameters when ξ 0 is > 0, with i =1, 2, …, m. The case ξ → 0 requires separate treatment using the Gumbel limit of the GEV distribution. This leads to the log likelihood There is no analytical solution, but for any given dataset the maximization is straightforward using standard numerical optimization algorithms.
Estimates of extreme quantile of the maximum distribution under linear normalization are obtained by inverting equation (1) The return levels are exceeded by the annual maximum in any particular time with probability (1 − p).
If x p are plotted against 1 1−p the plots are linear. By substituting the maximum likelihood estimates of the GEV parameters into (10), the maximum likelihood estimate of x p for 0 < p < 1, is obtained as where y p = − log log (1 − p). By the delta method, we get corresponding value is When the ξ → 0 and equation (12) is still valid with corresponding value is ( µ, σ ). Profile likelihood. Numerical evaluation of the profile likelihood for any of the individual parameters µ, σ or ξ is straightforward. For example, to obtain the profile likelihood for ξ , we fix ξ = ξ 0 , and maximize the log likelihood (8) with respect to the remaining parameters, µ and σ . This is repeated for a range of values of ξ 0 . This methodology can also be applied when inference is required on some combination of parameters. In particular, we can obtain the profile likelihood for any specified return level xp: This requires a re-parameterization of the GEV model, so that xp is one of the model parameters, after which the profile log likelihood is obtained by maximization with respect to the remaining parameters in the usual way. Re-parameterization is straightforward, so that replacement of µ in (8),(9) with (13) has the desired effect of expressing the GEV model in terms of the parameters (x p ,σ ,ξ ).

Model validity.
A probability plot is a comparison of the empirical and fitted distribution functions. With ordered block maximum data x 1 ≤ x 2 ≤ ... ≤ x m , the empirical distribution function evaluated at x i is given by

Science & Technology Development Journal -Engineering and Technology, 3(SI3):SI45-SI51
By substitution of parameter estimates into (2), the corresponding model based estimates are We then construct plot consisting of the points. ) are bound to approach 1 as x (i) increases, while it is usually the accuracy of the model for large values of x that is of greatest concern. That is, the probability plot provides the least information in the region of most interest. This deficiency is avoided by the quantile plot, consisting of the points {( , x (i)

)
, i = 1, 2, ..., m } If G is a reasonable estimate of G, then the quantile plot should also consist of points close to the unit diagonal.

CONCLUSION
To estimate the value of a parameter in GEV we can use classical methods of mathematical statistics such as the maximum likelihood method or the least squares method but they all require a certain number. samples for verification. For the bootstrap method, this is obviously not needed; here we use the limit theorems of probability theory and multivariate statistics to solve the problem even if there is only one sample data. That is the important practical significance that our paper wants to convey. We used the bootstrap method to process statistical data in hydrological and used random calculations, R software for analysis data.
In the research of Cuong at. el. 15 . Regarding water, salinity and flood peaks of the Mekong Delta, we have forecasted for the period up to 2020 based on data from 1976 to 2016. But we still see that the data is not long enough for more accurate forecasts, therefore we will use bootstrap to increase data for Predictive Analytics problems.