In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, exploratory data analysis (EDA) is an approach of
analyzing data sets to summarize their main characteristics, often using
statistical graphics
Statistical graphics, also known as statistical graphical techniques, are graphics used in the field of statistics for data visualization.
Overview
Whereas statistics and data analysis procedures generally yield their output in numeric or tabul ...
and other
data visualization
Data and information visualization (data viz or info viz) is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is num ...
methods. A
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by
John Tukey
John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from
initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
Overview
Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."
Tukey's championing of EDA encouraged the development of
statistical computing
Computational statistics, or statistical computing, is the bond between statistics and computer science. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific computin ...
packages, especially
S at
Bell Labs. The S programming language inspired the systems
S-PLUS and
R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify
outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s,
trends and
patterns in data that merited further study.
Tukey's EDA was related to two other developments in
statistical theory:
robust statistics and
nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
s. Tukey promoted the use of
five number summary
The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles:
# the sample minimum ''(smallest observation)''
# the lower quartile or ''first quart ...
of numerical data—the two
extreme
Extreme may refer to:
Science and mathematics Mathematics
*Extreme point, a point in a convex set which does not lie in any open line segment joining two points in the set
*Maxima and minima, extremes on a mathematical function
Science
*Extremop ...
s (
maximum and
minimum), the
median
In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
, and the
quartile
In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a ...
s—because these median and quartiles, being functions of the
empirical distribution are defined for all distributions, unlike the
mean and
standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
; moreover, the quartiles and median are more robust to
skewed
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.
For a unimoda ...
or
heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages
S,
S-PLUS, and
R included routines using
resampling statistics, such as Quenouille and Tukey's
jackknife and
Efron bootstrap, which are nonparametric and robust (for many problems).
Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the
analytic theory of
testing statistical hypotheses, particularly the
Laplacian
In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols \nabla\cdot\nabla, \nabla^2 (where \nabla is the ...
tradition's emphasis on
exponential families.
Development
John W. Tukey wrote the book ''Exploratory Data Analysis'' in 1977.
Tukey held that too much emphasis in statistics was placed on
statistical hypothesis testing
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
(confirmatory data analysis); more emphasis needed to be placed on using
data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to
systematic bias owing to the issues inherent in
testing hypotheses suggested by the data.
The objectives of EDA are to:
*Enable unexpected discoveries in the data
*Suggest hypotheses about the
causes Causes, or causality, is the relationship between one event and another. It may also refer to:
* Causes (band), an indie band based in the Netherlands
* Causes (company)
Causes.com is a civic-technology app and website that enables users to orga ...
of observed
phenomena
A phenomenon ( : phenomena) is an observable event. The term came into its modern philosophical usage through Immanuel Kant, who contrasted it with the noumenon, which ''cannot'' be directly observed. Kant was heavily influenced by Gottfried W ...
*Assess assumptions on which
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
will be based
*Support the selection of appropriate statistical tools and techniques
*Provide a basis for further data collection through
surveys or
experiments
Many EDA techniques have been adopted into
data mining. They are also being taught to young students as a way to introduce them to statistical thinking.
Techniques and tools
There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.
Typical
graphical techniques used in EDA are:
*
Box plot
*
Histogram
A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the ent ...
*
Multi-vari chart In quality control, multi-vari charts are a visual way of presenting variability through a series of charts. The content and format of the charts has evolved over time.
Original concept
Multi-vari charts were first described by Leonard Seder in 1 ...
*
Run chart
*
Pareto chart
*
Scatter plot
A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. ...
(2D/3D)
*
Stem-and-leaf plot
A stem-and-leaf display or stem-and-leaf plot is a device for presenting quantitative data in a information graphics, graphical format, similar to a histogram, to assist in visualizing the shape of a probability distribution, distribution. They e ...
*
Parallel coordinates
*
Odds ratio
*
Targeted projection pursuit
Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
*
Heat map
*
Bar chart
*Horizon graph
*Glyph-based visualization methods such as PhenoPlot and
Chernoff face
Chernoff faces, invented by applied mathematician, statistician and physicist Herman Chernoff in 1973, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variabl ...
s
* Projection methods such as grand tour, guided tour and manual tour
* Interactive versions of these plots
Dimensionality reduction:
*
Multidimensional scaling
*
Principal component analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA)
*
Multilinear PCA Within statistics, Multilinear principal component analysis (MPCA) is a multilinear extension of principal component analysis (PCA). MPCA is employed in the analysis of M-way arrays, i.e. a cube or hyper-cube of numbers, also informally referred ...
*
Nonlinear dimensionality reduction
Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-d ...
(NLDR)
*
Iconography of correlations
Typical
quantitative
Quantitative may refer to:
* Quantitative research, scientific investigation of quantitative properties
* Quantitative analysis (disambiguation)
* Quantitative verse, a metrical system in poetry
* Statistics, also known as quantitative analysis ...
techniques are:
*
Median polish
*
Trimean
*
Ordination
History
Many EDA ideas can be traced back to earlier authors, for example:
*
Francis Galton
Sir Francis Galton, FRS FRAI (; 16 February 1822 – 17 January 1911), was an English Victorian era polymath: a statistician, sociologist, psychologist, anthropologist, tropical explorer, geographer, inventor, meteorologist, proto- ...
emphasized
order statistics and
quantiles.
*
Arthur Lyon Bowley used precursors of the stemplot and
five-number summary (Bowley actually used a "
seven-figure summary", including the extremes,
deciles and
quartile
In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a ...
s, along with the median—see his ''Elementary Manual of Statistics'' (3rd edn., 1920), p. 62– he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions").
*
Andrew Ehrenberg articulated a philosophy of
data reduction (see his book of the same name).
The
Open University course ''Statistics in Society'' (MDST 242), took the above ideas and merged them with
Gottfried Noether's work, which introduced
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
via coin-tossing and the
median test.
Example
Findings from EDA are orthogonal to the primary analysis task. To illustrate, consider an example from Cook et al. where the analysis task is to find the variables which best predict the tip that a dining party will give to the waiter.
[ Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007) ″Interactive and Dynamic Graphics for Data Analysis: With R and GGobi″ Springer, 978-0387717616] The variables available in the data collected for this task are: the tip amount, total bill, payer gender, smoking/non-smoking section, time of day, day of the week, and size of the party. The primary analysis task is approached by fitting a regression model where the tip rate is the response variable. The fitted model is
: (
tip rate
Tip commonly refers to:
* Tip (gambling)
* Tip (gratuity)
* Tip (law enforcement)
* another term for Advice
Tip or TIP may also refer to:
Science and technology
* Tank phone, a device allowing infantry to communicate with the occupants of an armo ...
) = 0.18 - 0.01 × (party size)
which says that as the size of the dining party increases by one person (leading to a higher bill), the tip rate will decrease by 1%, on average.
However, exploring the data reveals other interesting features not described by this model.
Tips-hist1.png, Histogram of tip amounts where the bins cover $1 increments. The distribution of values is skewed right and unimodal, as is common in distributions of small, non-negative quantities.
Tips-hist2.png, Histogram of tip amounts where the bins cover $0.10 increments. An interesting phenomenon is visible: peaks occur at the whole-dollar and half-dollar amounts, which is caused by customers picking round numbers as tips. This behavior is common to other types of purchases too, like gasoline.
Tips-scat1.png, Scatterplot of tips vs. bill. Points below the line correspond to tips that are lower than expected (for that bill amount), and points above the line are higher than expected. We might expect to see a tight, positive linear association, but instead see variation that increases with tip amount. In particular, there are more points far away from the line in the lower right than in the upper left, indicating that more customers are very cheap than very generous.
Tips-scat2.png, Scatterplot of tips vs. bill separated by payer gender and smoking section status. Smoking parties have a lot more variability in the tips that they give. Males tend to pay the (few) higher bills, and the female non-smokers tend to be very consistent tippers (with three conspicuous exceptions shown in the sample).
What is learned from the plots is different from what is illustrated by the regression model, even though the experiment was not designed to investigate any of these other trends. The patterns found by exploring the data suggest hypotheses about tipping that may not have been anticipated in advance, and which could lead to interesting follow-up experiments where the hypotheses are formally stated and tested by collecting new data.
Software
*
JMP, an EDA package from
SAS Institute.
*
KNIME
KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...
, Konstanz Information Miner – Open-Source data exploration platform based on Eclipse.
*
Minitab, an EDA and general statistics package widely used in industrial and corporate settings.
*
Orange, an
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
data mining and
machine learning software suite.
*
Python, an open-source programming language widely used in data mining and machine learning.
*
R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for data science.
*
TinkerPlots
''TinkerPlots'' is exploratory data analysis and modeling software designed for use by students in grades 4 through university. It was designed bClifford KonoldanCraig Millerat the University of Massachusetts Amherst and is currently published by ...
an EDA software for upper elementary and middle school students.
*
Weka
The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus ''Gallirallus''. Four subspecies are recognize ...
an open source data mining package that includes visualization and EDA tools such as
targeted projection pursuit
Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
.
See also
*
Anscombe's quartet
Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (''x'',''y'') points. They were ...
, on importance of exploration
*
Data dredging
Data dredging (also known as data snooping or ''p''-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. ...
*
Predictive analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In business ...
*
Structured data analysis (statistics)
*
Configural frequency analysis Configural frequency analysis (CFA) is a method of exploratory data analysis, introduced by Gustav A. Lienert in 1969. The goal of a configural frequency analysis is to detect patterns in the data that occur Statistical significance, significantly m ...
*
Descriptive statistics
References
Bibliography
*Andrienko, N & Andrienko, G (2005) ''Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach''. Springer.
*Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-25994-5
Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616.
Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1.
Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8.
Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2
*
*
*
*Leinhardt, G., Leinhardt, S.,
Exploratory Data Analysis: New Tools for the Analysis of Empirical Data', Review of Research in Education, Vol. 8, 1980 (1980), pp. 85–157.
*
*Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and Examples, CRC Press, Boca Raton, FL,
*
*
*
* Young, F. W. Valero-Mora, P. and Friendly M. (2006
''Visual Statistics: Seeing your data with Dynamic Interactive Graphics'' Wiley
*Jambu M. (1991
''Exploratory and Multivariate Data Analysis'' Academic Press
*S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986
''Graphical Exploratory Data Analysis'' Springer
Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-25994-5
Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616.
Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1.
Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8.
Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2
External links
Carnegie Mellon University – free online course on Probability and Statistics, with a module on EDA
{{Authority control