HOME

TheInfoList



OR:

In statistics, exploratory data analysis (EDA) is an approach of
analyzing Analysis (plural, : analyses) is the process of breaking a complexity, complex topic or Substance theory, substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics a ...
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
s to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.


Overview

Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data." Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at
Bell Labs Nokia Bell Labs, originally named Bell Telephone Laboratories (1925–1984), then AT&T Bell Laboratories (1984–1996) and Bell Labs Innovations (1996–2007), is an American industrial Research and development, research and scientific developm ...
. The S programming language inspired the systems
S-PLUS S-PLUS is a commercial implementation of the S programming language sold by TIBCO Software Inc. It features object-oriented programming capabilities and advanced analytical algorithms. Due to the increasing popularity of the open source S succ ...
and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and
patterns A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated li ...
in data that merited further study. Tukey's EDA was related to two other developments in
statistical theory The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statistical-decision problems and to statistica ...
:
robust statistics Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, su ...
and
nonparametric statistics Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being dist ...
, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (
maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
and
minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
), the median, and the quartiles—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the '' ari ...
and standard deviation; moreover, the quartiles and median are more robust to
skewed In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimoda ...
or
heavy-tailed distribution In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distr ...
s than traditional summaries (the mean and standard deviation). The packages S,
S-PLUS S-PLUS is a commercial implementation of the S programming language sold by TIBCO Software Inc. It features object-oriented programming capabilities and advanced analytical algorithms. Due to the increasing popularity of the open source S succ ...
, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and
Efron Efron is a Jewish surname. It is taken from the Biblical place name, he, עפרון. Another version to it is the demonym Efroni (). Notable people with the surname include: * Bradley Efron (born 1938), American statistician * Edith Efron (1922� ...
bootstrap, which are nonparametric and robust (for many problems). Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition's emphasis on
exponential families In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
.


Development

John W. Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distribut ...
wrote the book ''Exploratory Data Analysis'' in 1977. Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using
data In the pursuit of knowledge, data (; ) is a collection of discrete Value_(semiotics), values that convey information, describing quantity, qualitative property, quality, fact, statistics, other basic units of meaning, or simply sequences of sy ...
to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to
systematic bias Systematic may refer to: Science * Short for systematic error * Systematic fault * Systematic bias, errors that are not determined by chance but are introduced by an inaccuracy (involving either the observation or measurement process) inheren ...
owing to the issues inherent in
testing hypotheses suggested by the data In statistics, hypotheses suggested by a given dataset, when tested with the same dataset that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning (double dipping) would be involved: somethin ...
. The objectives of EDA are to: *Enable unexpected discoveries in the data *Suggest hypotheses about the causes of observed phenomena *Assess assumptions on which statistical inference will be based *Support the selection of appropriate statistical tools and techniques *Provide a basis for further data collection through surveys or
experiments An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs when ...
Many EDA techniques have been adopted into data mining. They are also being taught to young students as a way to introduce them to statistical thinking.


Techniques and tools

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques. Typical graphical techniques used in EDA are: *
Box plot In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines (which are ca ...
* Histogram * Multi-vari chart *
Run chart A run chart, also known as a run-sequence plot is a graph that displays observed data in a time sequence. Often, the data displayed represent some aspect of the output or performance of a manufacturing or other business process. It is therefore ...
*
Pareto chart A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line. The chart is named for the Pareto principle, w ...
* Scatter plot (2D/3D) * Stem-and-leaf plot *
Parallel coordinates Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets. To show a set of points in an ''n''-dimensional space, a backdrop is drawn consisting of ''n'' parallel lines, typically vertical and equally spaced. ...
*
Odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
*
Targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
*
Heat map A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is c ...
*
Bar chart A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. A vertical bar chart i ...
*Horizon graph *Glyph-based visualization methods such as PhenoPlot and Chernoff faces * Projection methods such as grand tour, guided tour and manual tour * Interactive versions of these plots
Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
: *
Multidimensional scaling Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...
* Principal component analysis (PCA) * Multilinear PCA * Nonlinear dimensionality reduction (NLDR) *
Iconography of correlations In exploratory data analysis, the iconography of correlations is a method which consists in replacing a correlation matrix by a diagram where the “remarkable” correlations are represented by a solid line (positive correlation), or a dotted line ...
Typical
quantitative Quantitative may refer to: * Quantitative research, scientific investigation of quantitative properties * Quantitative analysis (disambiguation) * Quantitative verse, a metrical system in poetry * Statistics, also known as quantitative analysis ...
techniques are: *
Median polish The median polish is a simple and robust exploratory data analysis procedure proposed by the statistician John Tukey. The purpose of median polish is to find an additively-fit model for data in a two-way layout table (usually, results from a factor ...
*
Trimean In statistics the trimean (TM), or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles: : TM= \frac This is equivalent to the average of the m ...
*
Ordination Ordination is the process by which individuals are consecrated, that is, set apart and elevated from the laity class to the clergy, who are thus then authorized (usually by the denominational hierarchy composed of other clergy) to perform v ...


History

Many EDA ideas can be traced back to earlier authors, for example: * Francis Galton emphasized
order statistic In statistics, the ''k''th order statistic of a statistical sample is equal to its ''k''th-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference. Importan ...
s and
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile th ...
s. *
Arthur Lyon Bowley Sir Arthur Lyon Bowley, FBA (6 November 1869 – 21 January 1957) was an English statistician and economist who worked on economic statistics and pioneered the use of sampling techniques in social surveys. Early life Bowley's father, James Wil ...
used precursors of the stemplot and
five-number summary The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles: # the sample minimum ''(smallest observation)'' # the lower quartile or ''first quart ...
(Bowley actually used a " seven-figure summary", including the extremes,
decile In descriptive statistics, a decile is any of the nine values that divide the sorted data into ten equal parts, so that each part represents 1/10 of the sample or population. A decile is one possible form of a quantile; others include the quartile ...
s and quartiles, along with the median—see his ''Elementary Manual of Statistics'' (3rd edn., 1920), p. 62– he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions"). * Andrew Ehrenberg articulated a philosophy of
data reduction Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form. The purpose of data reduction can be two-fold: reduce the number of data rec ...
(see his book of the same name). The
Open University The Open University (OU) is a British public research university and the largest university in the United Kingdom by number of students. The majority of the OU's undergraduate students are based in the United Kingdom and principally study off- ...
course ''Statistics in Society'' (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced statistical inference via coin-tossing and the
median test In statistics, Mood's median test is a special case of Pearson's chi-squared test. It is a nonparametric test that tests the null hypothesis that the medians of the populations from which two or more samples are drawn are identical. The data in e ...
.


Example

Findings from EDA are orthogonal to the primary analysis task. To illustrate, consider an example from Cook et al. where the analysis task is to find the variables which best predict the tip that a dining party will give to the waiter. Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007) ″Interactive and Dynamic Graphics for Data Analysis: With R and GGobi″ Springer, 978-0387717616 The variables available in the data collected for this task are: the tip amount, total bill, payer gender, smoking/non-smoking section, time of day, day of the week, and size of the party. The primary analysis task is approached by fitting a regression model where the tip rate is the response variable. The fitted model is : ( tip rate) = 0.18 - 0.01 × (party size) which says that as the size of the dining party increases by one person (leading to a higher bill), the tip rate will decrease by 1%, on average. However, exploring the data reveals other interesting features not described by this model. Tips-hist1.png, Histogram of tip amounts where the bins cover $1 increments. The distribution of values is skewed right and unimodal, as is common in distributions of small, non-negative quantities. Tips-hist2.png, Histogram of tip amounts where the bins cover $0.10 increments. An interesting phenomenon is visible: peaks occur at the whole-dollar and half-dollar amounts, which is caused by customers picking round numbers as tips. This behavior is common to other types of purchases too, like gasoline. Tips-scat1.png, Scatterplot of tips vs. bill. Points below the line correspond to tips that are lower than expected (for that bill amount), and points above the line are higher than expected. We might expect to see a tight, positive linear association, but instead see variation that increases with tip amount. In particular, there are more points far away from the line in the lower right than in the upper left, indicating that more customers are very cheap than very generous. Tips-scat2.png, Scatterplot of tips vs. bill separated by payer gender and smoking section status. Smoking parties have a lot more variability in the tips that they give. Males tend to pay the (few) higher bills, and the female non-smokers tend to be very consistent tippers (with three conspicuous exceptions shown in the sample). What is learned from the plots is different from what is illustrated by the regression model, even though the experiment was not designed to investigate any of these other trends. The patterns found by exploring the data suggest hypotheses about tipping that may not have been anticipated in advance, and which could lead to interesting follow-up experiments where the hypotheses are formally stated and tested by collecting new data.


Software

* JMP, an EDA package from
SAS Institute SAS Institute (or SAS, pronounced "sass") is an American multinational developer of analytics software based in Cary, North Carolina. SAS develops and markets a suite of analytics software ( also called SAS), which helps access, manage, ana ...
. *
KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...
, Konstanz Information Miner – Open-Source data exploration platform based on Eclipse. *
Minitab Minitab is a statistics package developed at the Pennsylvania State University by researchers Barbara F. Ryan, Thomas A. Ryan, Jr., and Brian L. Joiner in conjunction with Triola Statistics Company in 1972. It began as a light version of OMNITA ...
, an EDA and general statistics package widely used in industrial and corporate settings. *
Orange Orange most often refers to: *Orange (fruit), the fruit of the tree species '' Citrus'' × ''sinensis'' ** Orange blossom, its fragrant flower *Orange (colour), from the color of an orange, occurs between red and yellow in the visible spectrum * ...
, an open-source data mining and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
software suite. *
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, an open-source programming language widely used in data mining and machine learning. * R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for data science. * TinkerPlots an EDA software for upper elementary and middle school students. *
Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus '' Gallirallus''. Four subspecies are recogni ...
an open source data mining package that includes visualization and EDA tools such as
targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
.


See also

* Anscombe's quartet, on importance of exploration * Data dredging *
Predictive analytics Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events. In busine ...
*
Structured data analysis (statistics) Structured data analysis is the statistical data analysis of structured data. This can arise either in the form of an ''a priori'' structure such as multiple-choice questionnaires or in situations with the need to search for structure that fits t ...
* Configural frequency analysis *
Descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...


References


Bibliography

*Andrienko, N & Andrienko, G (2005) ''Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach''. Springer. *Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-25994-5 Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8. Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2 * * * *Leinhardt, G., Leinhardt, S.,
Exploratory Data Analysis: New Tools for the Analysis of Empirical Data
', Review of Research in Education, Vol. 8, 1980 (1980), pp. 85–157. * *Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and Examples, CRC Press, Boca Raton, FL, * * * * Young, F. W. Valero-Mora, P. and Friendly M. (2006
''Visual Statistics: Seeing your data with Dynamic Interactive Graphics''
Wiley *Jambu M. (1991
''Exploratory and Multivariate Data Analysis''
Academic Press *S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986
''Graphical Exploratory Data Analysis''
Springer Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-25994-5 Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8. Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2


External links


Carnegie Mellon University – free online course on Probability and Statistics, with a module on EDA


{{Authority control