HOME

TheInfoList



OR:

Data exploration is an approach similar to initial
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enc ...
, whereby a
data analyst Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enc ...
uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.FOSTER Open Science
Overview of Data Exploration Techniques: Stratos Idreos, Olga Papaemmonouil, Surajit Chaudhuri.
These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data. Data exploration is typically conducted using a combination of automated and manual activities.Stanford.edu
2011 Wrangler: Interactive Visual Specification of Data Transformation Scripts, Kandel, Paepcke, Hellerstein Heer.
Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics. This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the
raw data Raw data, also known as primary data, are ''data'' (e.g., numbers, instrument readings, figures, etc.) collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores). If a scientis ...
.Stanford.edu
IEEE Visual Analytics Science & Technology (VAST), Oct 2012 Enterprise Data Analysis and Visualization: An Interview Study., Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer Proc.
All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis. Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (
data cleansing Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the dat ...
), correcting poorly formatted elements and defining relevant relationships across datasets. This process is also known as determining data quality. Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand. Traditionally, this had been a key area of focus for statisticians, with John Tukey being a key evangelist in the field.Exploratory Data Analysis
Pearson.
Today, data exploration is more widespread and is the focus of data analysts and
data scientists Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
; the latter being a relatively new role within enterprises and larger organizations.


Interactive Data Exploration

This area of data exploration has become an area of interest in the field of machine learning. This is a relatively new field and is still evolving. As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data. Many common patterns include
regression Regression or regressions may refer to: Science * Marine regression, coastal advance due to falling sea level, the opposite of marine transgression * Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
and
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning. By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques.Machine Learning for Data Exploration
/ref>


Software

*
Trifacta Trifacta is a privately owned software company headquartered in San Francisco with offices in Bengaluru, Boston, Berlin and London. The company was founded in October 2012 and primarily develops data wrangling software for data exploration and se ...
– a data preparation and analysis platform *
Paxata Paxata is a privately owned software company headquartered in Redwood City, California. It develops self-service data preparation software that gets data ready for data analytics software. Paxata's software is intended for business analysts, as op ...
– self-service data preparation software *
Alteryx Alteryx is an American computer software company based in Irvine, California, with a development center in Broomfield, Colorado. The company's products are used for data science and analytics. The software is designed to make advanced analytics ...
– data blending and advanced data analytics software * Microsoft Power BI - interactive visualization and data analysis tool *
OpenRefine OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, bu ...
- a standalone open source desktop application for data clean-up and data transformation *
Tableau software Tableau Software ( ) is an American interactive data visualization software company focused on business intelligence. It was founded in 2003 in Mountain View, California, and is currently headquartered in Seattle, Washington. In 2019 the c ...
– interactive data visualization software


See also

*
Exploratory data analysis In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but prima ...
* Machine learning * Data profiling * Data visualization


References

{{reflist Machine learning Data analysis Data management Data quality