In the analysis of data, a correlogram is a

chart A chart (sometimes known as a graph) is a graphics, graphical representation for data visualization, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart". A chart can repres ...

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

statistics. For example, in

time series analysis In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...

, a plot of the sample

autocorrelation Autocorrelation, sometimes known as serial correlation in the discrete time case, measures the correlation of a signal with a delayed copy of itself. Essentially, it quantifies the similarity between observations of a random variable at differe ...

r_h\,

versus

h\,

(the time lags) is an autocorrelogram. If

cross-correlation In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a ''sliding dot product'' or ''sliding inner-product''. It is commonly used f ...

is plotted, the result is called a cross-correlogram. The correlogram is a commonly used tool for checking

randomness In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. ...

in a

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...

. If random, autocorrelations should be near zero for any and all time-lag separations. If non-random, then one or more of the autocorrelations will be significantly non-zero. In addition, correlograms are used in the model identification stage for Box–Jenkins autoregressive moving average

time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...

models. Autocorrelations should be near-zero for randomness; if the analyst does not check for randomness, then the validity of many of the statistical conclusions becomes suspect. The correlogram is an excellent way of checking for such randomness. In

multivariate analysis Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., '' multivariate random variables''. Multivariate statistics concerns understanding the differ ...

, correlation matrices shown as color-mapped images may also be called "correlograms" or "corrgrams".

Applications

The correlogram can help provide answers to the following questions: * Are the data random? * Is an observation related to an adjacent observation? * Is an observation related to an observation twice-removed? (etc.) * Is the observed time series

white noise In signal processing, white noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. The term is used with this or similar meanings in many scientific and technical disciplines, i ...

? * Is the observed time series sinusoidal? * Is the observed time series autoregressive? * What is an appropriate model for the observed time series? * Is the model ::

Y = \text + \text

: valid and sufficient? * Is the formula

s_=s/\sqrt

valid?

Importance

Randomness (along with fixed model, fixed variation, and fixed distribution) is one of the four assumptions that typically underlie all measurement processes. The randomness assumption is critically important for the following three reasons: * Most standard

statistical test A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. ...

s depend on randomness. The validity of the test conclusions is directly linked to the validity of the randomness assumption. * Many commonly used statistical formulae depend on the randomness assumption, the most common formula being the formula for determining the

standard error The standard error (SE) of a statistic (usually an estimator of a parameter, like the average or mean) is the standard deviation of its sampling distribution or an estimate of that standard deviation. In other words, it is the standard deviati ...

of the sample mean: ::

s_=s/\sqrt

where ''s'' is the

standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...

of the data. Although heavily used, the results from using this formula are of no value unless the randomness assumption holds. * For univariate data, the default model is ::

Y = \text + \text

If the data are not random, this model is incorrect and invalid, and the estimates for the parameters (such as the constant) become nonsensical and invalid.

Estimation of autocorrelations

The autocorrelation coefficient at lag ''h'' is given by :

r_h = c_h/c_0 \,

where ''c_h'' is the autocovariance function :

c_h = \frac 1 N \sum_^ \left(Y_t - \bar\right)\left(Y_ - \bar\right)

and ''c''₀ is the variance function :

c_0 = \frac 1 N \sum_^N \left(Y_t - \bar\right)^2

The resulting value of ''r_h'' will range between −1 and +1.

Alternate estimate

Some sources may use the following formula for the autocovariance function: :

c_h = \frac\sum_^ \left(Y_t - \bar\right)\left(Y_ - \bar \right)

Although this definition has less

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...

, the (1/''N'') formulation has some desirable statistical properties and is the form most commonly used in the statistics literature. See pages 20 and 49–50 in Chatfield for details. In contrast to the definition above, this definition allows us to compute

c_h

in a slightly more intuitive way. Consider the sample

Y_1,\dots,Y_N

, where

Y_i \in \mathbb R^n

for

i = 1,\dots,N

. Then, let :

X = \beginY_1 - \bar Y & \cdots & Y_N - \bar Y\end \in \mathbb R^

We then compute the Gram matrix

Q = X^\top X

. Finally,

c_h

is computed as the sample mean of the

h

th diagonal of

Q

. For example, the

0

th diagonal (the main diagonal) of

Q

has

N

elements, and its sample mean corresponds to

c_0

. The

1

st diagonal (to the right of the main diagonal) of

Q

has

N-1

elements, and its sample mean corresponds to

c_1

, and so on.

Statistical inference with correlograms

In the same graph one can draw upper and lower bounds for autocorrelation with significance level

\alpha\,

: :

B=\pm z_ SE(r_h)\,

with

r_h\,

as the estimated autocorrelation at lag

h\,

. If the autocorrelation is higher (lower) than this upper (lower) bound, the null hypothesis that there is no autocorrelation at and beyond a given lag is rejected at a significance level of

\alpha\,

. This test is an approximate one and assumes that the time-series is Gaussian. In the above, ''z''_1−''α''/2 is the quantile of the

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

; SE is the standard error, which can be computed by Bartlett's formula for MA(''ℓ'') processes: :

SE(r_1)=\frac 1

SE(r_h)=\sqrt\frac

for

h>1.\,

In the example plotted, we can reject the

null hypothesis The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...

that there is no autocorrelation between time-points which are separated by lags up to 4. For most longer periods one cannot reject the

of no autocorrelation. Note that there are two distinct formulas for generating the confidence bands: 1. If the correlogram is being used to test for randomness (i.e., there is no time dependence in the data), the following formula is recommended: :

\pm \frac

where ''N'' is the

sample size Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences abo ...

, ''z'' is the

quantile function In probability and statistics, the quantile function is a function Q: ,1\mapsto \mathbb which maps some probability x \in ,1/math> of a random variable v to the value of the variable y such that P(v\leq y) = x according to its probability distr ...

of the

standard normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac e^ ...

and α is the

significance level In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...

. In this case, the confidence bands have fixed width that depends on the sample size. 2. Correlograms are also used in the model identification stage for fitting

ARIMA Arima, officially The Royal Chartered Borough of Arima is the easternmost and second largest in area of the three boroughs of Trinidad and Tobago. It is geographically adjacent to Sangre Grande and Arouca at the south central foothills of the ...

models. In this case, a moving average model is assumed for the data and the following confidence bands should be generated: :

\pm z_ \sqrt

where ''k'' is the lag. In this case, the confidence bands increase as the lag increases.

Software

Correlograms are available in most general purpose statistical libraries. Correlograms: * python pandas: pandas.plotting.autocorrelation_plot * R: functions acf and pacf Corrgrams: * python seaborn: heatmap, pairplot * R: corrgram

Related techniques

Partial autocorrelation function In time series analysis, the partial autocorrelation function (PACF) gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. It contrasts with the autocorre ...

* Lag plot * Spectral plot * Seasonal subseries plot * Scaled Correlation *

Variogram In spatial statistics the theoretical variogram, denoted 2\gamma(\mathbf_1,\mathbf_2), is a function describing the degree of spatial dependence of a spatial random field or stochastic process Z(\mathbf). The semivariogram \gamma(\mathbf_1,\ma ...

References

External links

Autocorrelation Plot
{{Statistics, descriptive Statistical charts and diagrams Autocorrelation