Surrogate data, sometimes known as analogous data,
usually refers to
time series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...
data that is produced using well-defined (linear) models like
ARMA
Arma, ARMA or variants, may refer to:
Places
* Arma, Kansas, United States
* Arma, Nepal
* Arma District, Peru
* Arma District, Yemen
* Arma Mountains, Afghanistan
People
* Arma people, an ethnic group of the middle Niger River valley
* Arma lan ...
processes that reproduce various statistical properties like the
autocorrelation
Autocorrelation, sometimes known as serial correlation in the discrete time case, measures the correlation of a signal with a delayed copy of itself. Essentially, it quantifies the similarity between observations of a random variable at differe ...
structure of a measured data set.
The resulting surrogate data can then for example be used for testing for non-linear structure in the empirical data; this is called
surrogate data testing
Surrogate data testing
(or the ''method of surrogate data'') is a statistical proof by contradiction technique similar to permutation tests and parametric bootstrapping. It is used to detect non-linearity in a time series. The technique involves ...
.
Surrogate or analogous data also refers to data used to supplement available data from which a
mathematical model
A mathematical model is an abstract and concrete, abstract description of a concrete system using mathematics, mathematical concepts and language of mathematics, language. The process of developing a mathematical model is termed ''mathematical m ...
is built. Under this definition, it may be generated (i.e.,
synthetic data
Synthetic data are artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.
Data generated by a comp ...
) or transformed from another source.
Uses
Surrogate data is used in environmental and laboratory settings, when study data from one source is used in estimation of characteristics of another source. For example, it has been used to model population trends in animal species.
It can also be used to model biodiversity, as it would be difficult to gather actual data on all species in a given area.
Surrogate data may be used in forecasting. Data from similar series may be pooled to improve forecast accuracy. Use of surrogate data may enable a model to account for patterns not seen in historical data.
Another use of surrogate data is to test models for non-linearity. The term
surrogate data testing
Surrogate data testing
(or the ''method of surrogate data'') is a statistical proof by contradiction technique similar to permutation tests and parametric bootstrapping. It is used to detect non-linearity in a time series. The technique involves ...
refers to algorithms used to analyze models in this way.
These tests typically involve generating data, whereas surrogate data in general can be produced or gathered in many ways.
Methods
One method of surrogate data is to find a source with similar conditions or parameters, and use those data in modeling.
Another method is to focus on patterns of the underlying system, and to search for a similar pattern in related data sources (for example, patterns in other related species or environmental areas).
Rather than using existing data from a separate source, surrogate data may be generated through statistical processes,
which may involve random data generation
using constraints of the model or system.
See also
*
Bootstrapping (statistics)
Bootstrapping is a procedure for estimating the distribution of an estimator by resampling (often with replacement) one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy ( bias, variance, confidence interval ...
*
Data augmentation
Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfi ...
*
Jackknife resampling
References
Further reading
* {{Cite journal , last1 = Schreiber , first1 = T. , last2 = Schmitz , first2 = A. , doi = 10.1103/PhysRevLett.77.635 , title = Improved Surrogate Data for Nonlinearity Tests , journal = Physical Review Letters , volume = 77 , issue = 4 , pages = 635–638 , year = 1996 , pmid = 10062864, bibcode = 1996PhRvL..77..635S , arxiv = chao-dyn/9909041 , s2cid = 13193081
Statistical data types
Nonlinear time series analysis