Q–Q plot
   HOME

TheInfoList



OR:

In statistics, a Q–Q plot (quantile-quantile plot) is a probability plot, a graphical method for comparing two
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
s by plotting their ''
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile th ...
s'' against each other. A point on the plot corresponds to one of the quantiles of the second distribution (-coordinate) plotted against the same quantile of the first distribution (-coordinate). This defines a parametric curve where the parameter is the index of the quantile interval. If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the identity line . If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line . Q–Q plots can also be used as a graphical means of estimating parameters in a location-scale family of distributions. A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as
location In geography, location or place are used to denote a region (point, line, or area) on Earth's surface or elsewhere. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ...
, scale, and
skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...
are similar or different in the two distributions. Q–Q plots can be used to compare collections of data, or theoretical distributions. The use of Q–Q plots to compare two samples of data can be viewed as a
non-parametric Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...
approach to comparing their underlying distributions. A Q–Q plot is generally more diagnostic than comparing the samples'
histogram A histogram is an approximate representation of the frequency distribution, distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to "Data binning, bin" (or "Data binning, buck ...
s, but is less widely known. Q–Q plots are commonly used to compare a data set to a theoretical model. This can provide an assessment of
goodness of fit The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measure ...
that is graphical, rather than reducing to a numerical
summary statistic In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in * a measure of ...
. Q–Q plots are also used to compare two theoretical distributions to each other. Since Q–Q plots compare distributions, there is no need for the values to be observed as pairs, as in a
scatter plot A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data ...
, or even for the numbers of values in the two groups being compared to be equal. The term "probability plot" sometimes refers specifically to a Q–Q plot, sometimes to a more general class of plots, and sometimes to the less commonly used
P–P plot In statistics, a P–P plot (probability–probability plot or percent–percent plot or P value plot) is a probability plot for assessing how closely two data sets agree, or for assessing how closely a dataset fits a particular model. It works b ...
. The probability plot correlation coefficient plot (PPCC plot) is a quantity derived from the idea of Q–Q plots, which measures the agreement of a fitted distribution with observed data and which is sometimes used as a means of fitting a distribution to data.


Definition and construction

A Q–Q plot is a plot of the quantiles of two distributions against each other, or a plot based on estimates of the quantiles. The pattern of points in the plot is used to compare the two distributions. The main step in constructing a Q–Q plot is calculating or estimating the quantiles to be plotted. If one or both of the axes in a Q–Q plot is based on a theoretical distribution with a continuous
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Eve ...
(CDF), all quantiles are uniquely defined and can be obtained by inverting the CDF. If a theoretical probability distribution with a discontinuous CDF is one of the two distributions being compared, some of the quantiles may not be defined, so an interpolated quantile may be plotted. If the Q–Q plot is based on data, there are multiple quantile estimators in use. Rules for forming Q–Q plots when quantiles must be estimated or interpolated are called plotting positions. A simple case is where one has two data sets of the same size. In that case, to make the Q–Q plot, one orders each set in increasing order, then pairs off and plots the corresponding values. A more complicated construction is the case where two data sets of different sizes are being compared. To construct the Q–Q plot in this case, it is necessary to use an
interpolated In the mathematical field of numerical analysis, interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points. In engineering and science, one often has a n ...
quantile estimate so that quantiles corresponding to the same underlying probability can be constructed. More abstractly, given two cumulative probability distribution functions and , with associated
quantile function In probability and statistics, the quantile function, associated with a probability distribution of a random variable, specifies the value of the random variable such that the probability of the variable being less than or equal to that value equ ...
s and (the inverse function of the CDF is the quantile function), the Q–Q plot draws the -th quantile of against the -th quantile of for a range of values of . Thus, the Q–Q plot is a
parametric curve In mathematics, a parametric equation defines a group of quantities as functions of one or more independent variables called parameters. Parametric equations are commonly used to express the coordinates of the points that make up a geometric obj ...
indexed over ,1with values in the real plane .


Interpretation

The points plotted in a Q–Q plot are always non-decreasing when viewed from left to right. If the two distributions being compared are identical, the Q–Q plot follows the 45° line . If the two distributions agree after linearly transforming the values in one of the distributions, then the Q–Q plot follows some line, but not necessarily the line . If the general trend of the Q–Q plot is flatter than the line , the distribution plotted on the horizontal axis is more dispersed than the distribution plotted on the vertical axis. Conversely, if the general trend of the Q–Q plot is steeper than the line , the distribution plotted on the vertical axis is more dispersed than the distribution plotted on the horizontal axis. Q–Q plots are often arced, or "S" shaped, indicating that one of the distributions is more skewed than the other, or that one of the distributions has heavier tails than the other. Although a Q–Q plot is based on quantiles, in a standard Q–Q plot it is not possible to determine which point in the Q–Q plot determines a given quantile. For example, it is not possible to determine the median of either of the two distributions being compared by inspecting the Q–Q plot. Some Q–Q plots indicate the deciles to make determinations such as this possible. The intercept and slope of a linear regression between the quantiles gives a measure of the relative location and relative scale of the samples. If the median of the distribution plotted on the horizontal axis is 0, the intercept of a regression line is a measure of location, and the slope is a measure of scale. The distance between medians is another measure of relative location reflected in a Q–Q plot. The " probability plot correlation coefficient" (PPCC plot) is the
correlation coefficient A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two components ...
between the paired sample quantiles. The closer the correlation coefficient is to one, the closer the distributions are to being shifted, scaled versions of each other. For distributions with a single shape parameter, the probability plot correlation coefficient plot provides a method for estimating the shape parameter – one simply computes the correlation coefficient for different values of the shape parameter, and uses the one with the best fit, just as if one were comparing distributions of different types. Another common use of Q–Q plots is to compare the distribution of a sample to a theoretical distribution, such as the standard
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
, as in a
normal probability plot The normal probability plot is a graphical technique to identify substantive departures from normality. This includes identifying outliers, skewness, kurtosis, a need for transformations, and mixtures. Normal probability plots are made of raw ...
. As in the case when comparing two samples of data, one orders the data (formally, computes the order statistics), then plots them against certain quantiles of the theoretical distribution.


Plotting positions

The choice of quantiles from a theoretical distribution can depend upon context and purpose. One choice, given a sample of size , is for , as these are the quantiles that the sampling distribution realizes. The last of these, , corresponds to the 100th percentile – the maximum value of the theoretical distribution, which is sometimes infinite. Other choices are the use of , or instead to space the points evenly in the uniform distribution, using . Many other choices have been suggested, both formal and heuristic, based on theory or simulations relevant in context. The following subsections discuss some of these. A narrower question is choosing a maximum (estimation of a population maximum), known as the German tank problem, for which similar "sample maximum, plus a gap" solutions exist, most simply . A more formal application of this uniformization of spacing occurs in
maximum spacing estimation In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of ''spac ...
of parameters.


Expected value of the order statistic for a uniform distribution

The approach equals that of plotting the points according to the probability that the last of () randomly drawn values will not exceed the -th smallest of the first randomly drawn values.


Expected value of the order statistic for a standard normal distribution

In using a
normal probability plot The normal probability plot is a graphical technique to identify substantive departures from normality. This includes identifying outliers, skewness, kurtosis, a need for transformations, and mixtures. Normal probability plots are made of raw ...
, the quantiles one uses are the rankits, the quantile of the expected value of the order statistic of a standard normal distribution. More generally, Shapiro–Wilk test uses the expected values of the order statistics of the given distribution; the resulting plot and line yields the
generalized least squares In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinar ...
estimate for location and scale (from the intercept and
slope In mathematics, the slope or gradient of a line is a number that describes both the ''direction'' and the ''steepness'' of the line. Slope is often denoted by the letter ''m''; there is no clear answer to the question why the letter ''m'' is use ...
of the fitted line).Testing for Normality
by Henry C. Thode, CRC Press, 2002,
p. 31
/ref> Although this is not too important for the normal distribution (the location and scale are estimated by the mean and standard deviation, respectively), it can be useful for many other distributions. However, this requires calculating the expected values of the order statistic, which may be difficult if the distribution is not normal.


Median of the order statistics

Alternatively, one may use estimates of the ''
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...
'' of the order statistics, which one can compute based on estimates of the median of the order statistics of a uniform distribution and the quantile function of the distribution; this was suggested by . This can be easily generated for any distribution for which the quantile function can be computed, but conversely the resulting estimates of location and scale are no longer precisely the least squares estimates, though these only differ significantly for small.


Heuristics

Several different formulas have been used or proposed as
affine Affine may describe any of various topics concerned with connections or affinities. It may refer to: * Affine, a relative by marriage in law and anthropology * Affine cipher, a special case of the more general substitution cipher * Affine comb ...
symmetrical Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...
plotting positions. Such formulas have the form for some value of in the range from 0 to 1, which gives a range between and . Expressions include: * * . * . * . * . * . * . * . * . * .. * . For large sample size, , there is little difference between these various expressions.


Filliben's estimate

The order statistic medians are the medians of the
order statistics In statistics, the ''k''th order statistic of a statistical sample is equal to its ''k''th-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference. Importa ...
of the distribution. These can be expressed in terms of the quantile function and the order statistic medians for the continuous uniform distribution by: : N(i) = G(U(i)) where are the uniform order statistic medians and is the quantile function for the desired distribution. The quantile function is the inverse of the
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Eve ...
(probability that is less than or equal to some value). That is, given a probability, we want the corresponding quantile of the cumulative distribution function. James J. Filliben uses the following estimates for the uniform order statistic medians: : m(i) = \begin 1 - 0.5^ & i = 1\\ \\ \dfrac & i = 2, 3, \ldots, n-1\\ \\ 0.5^ & i = n.\end The reason for this estimate is that the order statistic medians do not have a simple form.


See also

*
Empirical distribution function In statistics, an empirical distribution function (commonly also called an empirical Cumulative Distribution Function, eCDF) is the distribution function associated with the empirical measure of a sample. This cumulative distribution function ...
*
Probit In probability theory and statistics, the probit function is the quantile function associated with the standard normal distribution. It has applications in data analysis and machine learning, in particular exploratory statistical graphics and s ...
analysis was developed by
Chester Ittner Bliss Chester Ittner Bliss (February 1, 1899 – March 14, 1979) was primarily a biologist, who is best known for his contributions to statistics. He was born in Springfield, Ohio in 1899 and died in 1979. He was the first secretary of the International ...
in 1934.


Notes


References


Citations


Sources

* * * * Cleveland, W.S. (1994) ''The Elements of Graphing Data'', Hobart Press * * * Gnanadesikan, R. (1977) ''Methods for Statistical Analysis of Multivariate Observations'', Wiley . *


External links


Probability plot
* Alternate description of the QQ-Plot: http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#qqplot {{DEFAULTSORT:Q-Q Plot Statistical charts and diagrams