Boxplot
   HOME

TheInfoList



OR:

In
descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...
, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their
quartile In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a ...
s. In addition to the box on a box plot, there can be lines (which are called ''whiskers'') extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also termed as the box-and-whisker plot and the box-and-whisker diagram.
Outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box-plot. Box plots are
non-parametric Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...
: they display variation in samples of a
statistical population In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypoth ...
without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length). The spacings in each subsection of the box-plot indicate the degree of
dispersion Dispersion may refer to: Economics and finance *Dispersion (finance), a measure for the statistical distribution of portfolio returns *Price dispersion, a variation in prices across sellers of the same item *Wage dispersion, the amount of variatio ...
(spread) and
skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...
of the data, which are usually described using the
five-number summary The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles: # the sample minimum ''(smallest observation)'' # the lower quartile or ''first quart ...
. In addition, the box-plot allows one to visually estimate various
L-estimator In statistics, an L-estimator is an estimator which is a linear combination of order statistics of the measurements (which is also called an L-statistic). This can be as little as a single point, as in the median (of an odd number of values), or a ...
s, notably the
interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the differen ...
,
midhinge In statistics, the midhinge is the average of the first and third quartiles and is thus a measure of location. Equivalently, it is the 25% trimmed mid-range or 25% midsummary; it is an L-estimator. : \operatorname(X) = \overline = \frac = \frac ...
, range,
mid-range In statistics, the mid-range or mid-extreme is a measure of central tendency of a sample defined as the arithmetic mean of the maximum and minimum values of the data set: :M=\frac. The mid-range is closely related to the range, a measure of ...
, and
trimean In statistics the trimean (TM), or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles: : TM= \frac This is equivalent to the average of the median ...
. Box plots can be drawn either horizontally or vertically.


History

The range-bar method was first introduced by Mary Eleanor Spear in her book "Charting Statistics" in 1952 and again in her book "Practical Charting Techniques" in 1969. The box-and-whisker plot was first introduced in 1970 by
John Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
, who later published on the subject in his book "Exploratory Data Analysis" in 1977.


Elements

A boxplot is a standardized way of displaying the dataset based on the
five-number summary The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles: # the sample minimum ''(smallest observation)'' # the lower quartile or ''first quart ...
: the minimum, the maximum, the sample median, and the first and third quartiles. *
Minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
(''Q''0 or 0th
percentile In statistics, a ''k''-th percentile (percentile score or centile) is a score ''below which'' a given percentage ''k'' of scores in its frequency distribution falls (exclusive definition) or a score ''at or below which'' a given percentage fall ...
): the lowest data point in the data set excluding any outliers *
Maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
(''Q''4 or 100th percentile): the highest data point in the data set excluding any outliers *
Median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...
(''Q''2 or 50th percentile): the middle value in the data set *
First quartile In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are ...
(''Q''1 or 25th percentile): also known as the ''lower quartile'' ''q''''n''(0.25), it is the median of the lower half of the dataset. *
Third quartile In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are ...
(''Q''3 or 75th percentile): also known as the ''upper quartile'' ''q''''n''(0.75), it is the median of the upper half of the dataset. In addition to the minimum and maximum values used to construct a box-plot, another important element that can also be employed to obtain a box-plot is the interquartile range (IQR), as denoted below: *
Interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the differen ...
(IQR) : the distance between the upper and lower quartiles : \text = Q_3 - Q_1 = q_n(0.75) - q_n(0.25) A box-plot usually includes two parts, a box and a set of whiskers as shown in Figure 2. The box is drawn from ''Q''1 to ''Q''3 with a horizontal line drawn in the middle to denote the median. The whiskers can be defined in various ways. In the most straight-forward method, the boundary of the lower whisker is the minimum value of the data set, and the boundary of the upper whisker is the maximum value of the data set. Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile (''Q''3), a distance of 1.5 times the IQR is measured out and a whisker is drawn ''up to'' the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile (''Q''1) and a whisker is drawn ''down to'' the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as ''outliers''. The outliers can be plotted on the box-plot as a dot, a small circle, a star, ''etc.''. However, the whiskers can stand for several other things, such as: * The minimum and the maximum value of the data set (as shown in Figure 2) * One
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, whil ...
above and below the mean of the data set * The 9th percentile and the 91st percentile of the data set * The 2nd percentile and the 98th percentile of the data set Rarely, box-plot can be plotted without the whiskers. Some box plots include an additional character to represent the mean of the data. The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to depict the
seven-number summary In descriptive statistics, the seven-number summary is a collection of seven summary statistics, and is an extension of the five-number summary. There are three similar, common forms. As with the five-number summary, it can be represented by a modi ...
. If the data are normally distributed, the locations of the seven marks on the box plot will be equally spaced. On some box plots, a cross-hatch is placed before the end of each whisker. Because of this variability, it is appropriate to describe the convention that is being used for the whiskers and outliers in the caption of the box-plot.


Variations

Since the mathematician
John W. Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distribut ...
first popularized this type of visual data display in 1969, several variations on the classical box plot have been developed, and the two most commonly found variations are the variable width box plots and the notched box plots shown in Figure 4. Variable width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group. Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide of the significance of the difference of medians; if the notches of two boxes do not overlap, this will provide evidence of a statistically significant difference between the medians. The width of the notches is proportional to the interquartile range (IQR) of the sample and is inversely proportional to the square root of the size of the sample. However, there is a uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples). One convention for obtaining the boundaries of these notches is to use a distance of \pm \frac around the median. Adjusted box plots are intended to describe skew distributions, and they rely on the
medcouple In statistics, the medcouple is a robust statistic that measures the skewness of a univariate distribution. It is defined as a scaled median difference between the left and right half of a distribution. Its robustness makes it suitable for identi ...
statistic of skewness. For a medcouple value of MC, the lengths of the upper and lower whiskers on the box-plot are respectively defined to be: :\begin 1.5 \text \cdot e^, & 1.5 \text \cdot e^ \text \text \geq 0, \\ 1.5 \text \cdot e^, & 1.5 \text \cdot e^ \text \text \leq 0. \end For a symmetrical data distribution, the medcouple will be zero, and this reduces the adjusted box-plot to the Tukey's box-plot with equal whisker lengths of 1.5 \text for both whiskers. Other kinds of box plots, such as the
violin plot The violin, sometimes known as a ''fiddle'', is a wooden chordophone (string instrument) in the violin family. Most violins have a hollow wooden body. It is the smallest and thus highest-pitched instrument (soprano) in the family in regular ...
s and the bean plots can show the difference between single-modal and multimodal distributions, which cannot be observed from the original classical box-plot.


Examples


Example without outliers

A series of hourly temperatures were measured throghout the day in degrees Fahrenheit. The recorded values are listed in order as follows (°F): 57, 57, 57, 58, 63, 66, 66, 67, 67, 68, 67. A box plot of the data set can be generated by first calculating five relevant values of this data set: minimum, maximum, median (''Q''2), first quartile (''Q''1), and third quartile (''Q''3). The minimum is the smallest number of the data set. In this case, the minimum recorded day temperature is 57 °F. The maximum is the largest number of the data set. In this case, the maximum recorded day temperature is 81 °F. The median is the "middle" number of the ordered data set. This means that there are exactly 50% of the elements is less than the median and 50% of the elements is greater than the median. The median of this ordered data set is 70 °F. The first quartile value (''Q''1 or 25th percentile) is the number that marks one quarter of the ordered data set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly 75% of the elements that are greater than it. The first quartile value can be easily determined by finding the "middle" number between the minimum and the median. For the hourly temperatures, the "middle" number found between 57 °F and 70 °F is 66 °F. The third quartile value (''Q''3 or 75th percentile) is the number that marks three quarters of the ordered data set. In other words, there are exactly 75% of the elements that are less than the third quartile and 25% of the elements that are greater than it. The third quartile value can be easily obtained by finding the "middle" number between the median and the maximum. For the hourly temperatures, the "middle" number between 70 °F and 81 °F is 75 °F. The interquartile range, or IQR, can be calculated by subtracting the first quartile value (''Q''1) from the third quartile value (''Q''3): : \text = Q_3 - Q_1=75^\circ F-66^\circ F=9^\circ F. Hence, 1.5 \text=1.5 \cdot 9^\circ F=13.5 ^\circ F. 1.5 IQR above the third quartile is: : Q_3+1.5\text=75^\circ F+13.5^\circ F=88.5^\circ F. 1.5 IQR below the first quartile is: : Q_1-1.5\text=66^\circ F-13.5^\circ F=52.5^\circ F. The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third quartile. Here, 1.5 IQR above the third quartile is 88.5 °F and the maximum is 81 °F. Therefore, the upper whisker is drawn at the value of the maximum, which is 81 °F. Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR below the first quartile. Here, 1.5 IQR below the first quartile is 52.5 °F and the minimum is 57 °F. Therefore, the lower whisker is drawn at the value of the minimum, which is 57 °F.


Example with outliers

Above is an example without outliers. Here is a followup example for generating box-plot with outliers: The ordered set for the recorded temperatures is (°F): 52, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 89. In this example, only the first and the last number are changed. The median, third quartile, and first quartile remain the same. In this case, the maximum value in this data set is 89 °F, and 1.5 IQR above the third quartile is 88.5 °F. The maximum is greater than 1.5 IQR plus the third quartile, so the maximum is an outlier. Therefore, the upper whisker is drawn at the greatest value smaller than 1.5 IQR above the third quartile, which is 79 °F. Similarly, the minimum value in this data set is 52 °F, and 1.5 IQR below the first quartile is 52.5 °F. The minimum is smaller than 1.5 IQR minus the first quartile, so the minimum is also an outlier. Therefore, the lower whisker is drawn at the smallest value greater than 1.5 IQR below the first quartile, which is 57 °F.


In the case of large datasets

An additional example for obtaining box-plot from a data set containing a large number of data points is:


General equation to compute empirical quantiles

: q_n(p) = x_ + \alpha(x_ - x_) : \text k =
(n+1) N, or n, is the fourteenth letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''en'' (pronounced ), plural ''ens''. History ...
\text \alpha = p(n+1) - k :Here x_ stands for the general ordering of the data points (i.e. if i, then x_ < x_ ) Using the above example that has 24 data points (''n'' = 24), one can calculate the median, first and third quartile either mathematically or visually. Median : q_n(0.5) = x_ + (0.5\cdot25-12)\cdot(x_-x_) = 70+(0.5\cdot25-12)\cdot(70-70) = 70^\circ F First quartile : q_n(0.25) = x_ + (0.25\cdot25-6)\cdot(x_-x_) = 66 +(0.25\cdot25 - 6)\cdot(66-66) = 66^\circ F Third quartile : q_n(0.75) = x_ + (0.75\cdot25-18)\cdot(x_-x_) =75 + (0.75\cdot25-18)\cdot(75-75) = 75^\circ F


Visualization

Although box plots may seem more primitive than
histograms A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the en ...
or kernel density estimates, they do have a number of advantages. First, the box plot enables statisticians to do a quick graphical examination on one or more data sets. Box-plots also take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data in parallel (see Figure 1 for an example). Lastly, the overall structure of histograms and kernel density estimate can be strongly influenced by the choice of number and width of bins techniques and the choice of bandwidth, respectively. Although looking at a statistical distribution is more common than looking at a box plot, it can be useful to compare the box plot against the probability density function (theoretical histogram) for a normal N(0,''σ''2) distribution and observe their characteristics directly (as shown in Figure 7).


See also

*
Bagplot A bagplot, or starburst plot, is a method in robust statistics for visualizing two- or three-dimensional statistical data, analogous to the one-dimensional box plot. Introduced in 1999 by Rousseuw et al., the bagplot allows one to visualize the ...
*
Candlestick chart A candlestick chart (also called Japanese candlestick chart or K-line) is a style of financial chart used to describe price movements of a security, derivative, or currency. It is similar to a bar chart in that each candlestick represents al ...
*
Exploratory data analysis In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but pri ...
* Fan chart *
Five-number summary The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles: # the sample minimum ''(smallest observation)'' # the lower quartile or ''first quart ...
* Functional boxplot *
Seven-number summary In descriptive statistics, the seven-number summary is a collection of seven summary statistics, and is an extension of the five-number summary. There are three similar, common forms. As with the five-number summary, it can be represented by a modi ...


References


Further reading

* * *


External links


Beeswarm Boxplot
- superimposing a frequency-jittered stripchart on top of a box plot {{Statistics, descriptive Statistical charts and diagrams Statistical outliers