HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, exploratory data analysis (EDA) is an approach of analyzing
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
s to summarize their main characteristics, often using
statistical graphics Statistical graphics, also known as statistical graphical techniques, are graphics used in the field of statistics for data visualization. Overview Whereas statistics and data analysis procedures generally yield their output in numeric or tabul ...
and other
data visualization Data and information visualization (data viz/vis or info viz/vis) is the practice of designing and creating Graphics, graphic or visual Representation (arts), representations of a large amount of complex quantitative and qualitative data and i ...
methods. A
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
can be used or not, but primarily EDA is for seeing what the data can tell beyond the formal modeling and thereby contrasts with traditional hypothesis testing, in which a model is supposed to be selected before the data is seen. Exploratory data analysis has been promoted by
John Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.


Overview

Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data." Exploratory data analysis is a technique to analyze and investigate a dataset and summarize its main characteristics. A main advantage of EDA is providing the visualization of data after conducting analysis. Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at
Bell Labs Nokia Bell Labs, commonly referred to as ''Bell Labs'', is an American industrial research and development company owned by Finnish technology company Nokia. With headquarters located in Murray Hill, New Jersey, Murray Hill, New Jersey, the compa ...
. The S programming language inspired the systems
S-PLUS S-PLUS is a commercial implementation of the S (programming language), S programming language sold by TIBCO Software Inc. It features object-oriented programming capabilities and advanced analytical algorithms. Its statistical analysis capabilit ...
and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s, trends and patterns in data that merited further study. Tukey's EDA was related to two other developments in
statistical theory The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statistical-decision problems and to statistica ...
:
robust statistics Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust Statistics, statistical methods have been developed for many common problems, such as estimating location parame ...
and
nonparametric statistics Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in parametric s ...
, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
s. Tukey promoted the use of
five number summary The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles: # the sample minimum ''(smallest observation)'' # the lower quartile or ''first quar ...
of numerical data—the two extremes (
maximum In mathematical analysis, the maximum and minimum of a function (mathematics), function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given Interval (ma ...
and
minimum In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative ...
), the
median The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
, and the
quartile In statistics, quartiles are a type of quantiles which divide the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are ...
s—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the
mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
and
standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
. Moreover, the quartiles and median are more robust to
skewed In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...
or
heavy-tailed distribution In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. Roughly speaking, “heavy-tailed” means the distribu ...
s than traditional summaries (the mean and standard deviation). The packages S,
S-PLUS S-PLUS is a commercial implementation of the S (programming language), S programming language sold by TIBCO Software Inc. It features object-oriented programming capabilities and advanced analytical algorithms. Its statistical analysis capabilit ...
, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron bootstrap, which are nonparametric and robust (for many problems). Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, both of which were of interest to Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the
Laplacian In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols \nabla\cdot\nabla, \nabla^2 (where \nabla is th ...
tradition's emphasis on exponential families.


Development

John W. Tukey wrote the book ''Exploratory Data Analysis'' in 1977. Tukey held that too much emphasis in statistics was placed on
statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
(confirmatory data analysis); more emphasis needed to be placed on using
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to
systematic bias Systematic may refer to: Science * Short for systematic error * Systematic fault In engineering, a fault is a defect or problem in a system that causes it to fail or act abnormally. An example of this is the Windows fault screen, commonly r ...
owing to the issues inherent in
testing hypotheses suggested by the data In statistics, hypotheses suggested by a given dataset, when tested with the same dataset that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning (double dipping) would be involved: somethi ...
. The objectives of EDA are to: *Enable unexpected discoveries in the data *Suggest hypotheses about the
causes Causes, or causality, is the relationship between one event and another. It may also refer to: * Causes (band), an indie band based in the Netherlands * Causes (company) Causes is a for-profit civic-technology app and website that enables users ...
of observed
phenomena A phenomenon ( phenomena), sometimes spelled phaenomenon, is an observable Event (philosophy), event. The term came into its modern Philosophy, philosophical usage through Immanuel Kant, who contrasted it with the noumenon, which ''cannot'' be ...
*Assess assumptions on which
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
will be based *Support the selection of appropriate statistical tools and techniques *Provide a basis for further data collection through surveys or
experiments An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into Causality, cause-and-effect by demonstrating what outcome o ...
Many EDA techniques have been adopted into
data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
. They are also being taught to young students as a way to introduce them to statistical thinking.


Techniques and tools

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques. Typical graphical techniques used in EDA are: *
Box plot In descriptive statistics, a box plot or boxplot is a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines (which are ca ...
*
Histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...
* Multi-vari chart *
Run chart A run chart, also known as a run-sequence plot is a graph that displays observed data in a time sequence. Often, the data displayed represent some aspect of the output or performance of a manufacturing or other business process. It is therefore ...
* Pareto chart * Scatter plot (2D/3D) * Stem-and-leaf plot * Parallel coordinates *
Odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of event A taking place in the presence of B, and the odds of A in the absence of B ...
*
Targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
*
Heat map A heat map (or heatmap) is a 2-dimensional data visualization technique that represents the magnitude of individual values within a dataset as a color. The variation in color may be by hue or intensity. In some applications such as crime analy ...
*
Bar chart A bar chart or bar graph is a chart or graph that presents categorical variable, categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. A ...
*Horizon graph *Glyph-based visualization methods such as PhenoPlot and Chernoff faces * Projection methods such as grand tour, guided tour and manual tour * Interactive versions of these plots
Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
: *
Multidimensional scaling Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of n objects in a set into a configuration of n points mapped into an ...
*
Principal component analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that th ...
(PCA) * Multilinear PCA *
Nonlinear dimensionality reduction Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear de ...
(NLDR) * Iconography of correlations Typical
quantitative Quantitative may refer to: * Quantitative research, scientific investigation of quantitative properties * Quantitative analysis (disambiguation) * Quantitative verse, a metrical system in poetry * Statistics, also known as quantitative analysis ...
techniques are: * Median polish *
Trimean In statistics the trimean (TM), or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles: : TM= \frac This is equivalent to the arithmetic mean of ...
*
Ordination Ordination is the process by which individuals are Consecration in Christianity, consecrated, that is, set apart and elevated from the laity class to the clergy, who are thus then authorized (usually by the religious denomination, denominationa ...


History

Many EDA ideas can be traced back to earlier authors, for example: *
Francis Galton Sir Francis Galton (; 16 February 1822 – 17 January 1911) was an English polymath and the originator of eugenics during the Victorian era; his ideas later became the basis of behavioural genetics. Galton produced over 340 papers and b ...
emphasized
order statistic In statistics, the ''k''th order statistic of a statistical sample is equal to its ''k''th-smallest value. Together with Ranking (statistics), rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and ...
s and
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities or dividing the observations in a sample in the same way. There is one fewer quantile t ...
s. * Arthur Lyon Bowley used precursors of the stemplot and five-number summary (Bowley actually used a " seven-figure summary", including the extremes, deciles and
quartile In statistics, quartiles are a type of quantiles which divide the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are ...
s, along with the median—see his ''Elementary Manual of Statistics'' (3rd edn., 1920), p. 62– he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions"). * Andrew Ehrenberg articulated a philosophy of data reduction (see his book of the same name). The
Open University The Open University (OU) is a Public university, public research university and the largest university in the United Kingdom by List of universities in the United Kingdom by enrolment, number of students. The majority of the OU's undergraduate ...
course ''Statistics in Society'' (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
via coin-tossing and the median test.


Example

Findings from EDA are orthogonal to the primary analysis task. To illustrate, consider an example from Cook et al. where the analysis task is to find the variables which best predict the tip that a dining party will give to the waiter. Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007) "Interactive and Dynamic Graphics for Data Analysis: With R and GGobi" Springer, 978-0387717616 The variables available in the data collected for this task are: the tip amount, total bill, payer gender, smoking/non-smoking section, time of day, day of the week, and size of the party. The primary analysis task is approached by fitting a regression model where the tip rate is the response variable. The fitted model is : ( tip rate) = 0.18 - 0.01 × (party size) which says that as the size of the dining party increases by one person (leading to a higher bill), the tip rate will decrease by 1%, on average. However, exploring the data reveals other interesting features not described by this model. Tips-hist1.png, Histogram of tip amounts where the bins cover $1 increments. The distribution of values is skewed right and unimodal, as is common in distributions of small, non-negative quantities. Tips-hist2.png, Histogram of tip amounts where the bins cover $0.10 increments. An interesting phenomenon is visible: peaks occur at the whole-dollar and half-dollar amounts, which is caused by customers picking round numbers as tips. This behavior is common to other types of purchases too, like gasoline. Tips-scat1.png, Scatterplot of tips vs. bill. Points below the line correspond to tips that are lower than expected (for that bill amount), and points above the line are higher than expected. We might expect to see a tight, positive linear association, but instead see variation that increases with tip amount. In particular, there are more points far away from the line in the lower right than in the upper left, indicating that more customers are very cheap than very generous. Tips-scat2.png, Scatterplot of tips vs. bill separated by payer gender and smoking section status. Smoking parties have a lot more variability in the tips that they give. Males tend to pay the (few) higher bills, and the female non-smokers tend to be very consistent tippers (with three conspicuous exceptions shown in the sample). What is learned from the plots is different from what is illustrated by the regression model, even though the experiment was not designed to investigate any of these other trends. The patterns found by exploring the data suggest hypotheses about tipping that may not have been anticipated in advance, and which could lead to interesting follow-up experiments where the hypotheses are formally stated and tested by collecting new data.


Software

* JMP, an EDA package from SAS Institute. *
KNIME KNIME (), the Konstanz Information Miner, is a data analytics, reporting and integrating platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" con ...
, Konstanz Information Miner – Open-Source data exploration platform based on Eclipse. * Minitab, an EDA and general statistics package widely used in industrial and corporate settings. * Orange, an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
software suite. * Python, an open-source programming language widely used in data mining and machine learning. * Matplotlib & Seaborn are the Python libraries used in todays world for EDA and Plotting/Data Visualization.(point updated: 2025) * R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for data science. * TinkerPlots an EDA software for upper elementary and middle school students. * Weka an open source data mining package that includes visualization and EDA tools such as
targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
.


See also

* Anscombe's quartet, on importance of exploration *
Data dredging Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Th ...
*
Predictive analytics Predictive analytics encompasses a variety of Statistics, statistical techniques from data mining, Predictive modelling, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or other ...
*
Structured data analysis (statistics) Structured data analysis is the statistical data analysis of structured data. This can arise either in the form of an ''a priori'' structure such as multiple-choice questionnaires or in situations with the need to search for structure that fits t ...
* Configural frequency analysis *
Descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...


References


Bibliography

* Andrienko, N & Andrienko, G (2005) ''Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach''. Springer. * *Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616. *Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1. *Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8. *Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 * S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2 * * * *Leinhardt, G., Leinhardt, S.,
Exploratory Data Analysis: New Tools for the Analysis of Empirical Data
', Review of Research in Education, Vol. 8, 1980 (1980), pp. 85–157. * *Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and Examples, CRC Press, Boca Raton, FL, * * * * Young, F. W. Valero-Mora, P. and Friendly M. (2006
''Visual Statistics: Seeing your data with Dynamic Interactive Graphics''
Wiley *Jambu M. (1991
''Exploratory and Multivariate Data Analysis''
Academic Press *S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986
''Graphical Exploratory Data Analysis''
Springer


External links


Carnegie Mellon University – free online course on Probability and Statistics, with a module on EDA


{{Authority control