
Data science is an
interdisciplinary
Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several other fields like sociology, anthropology, psychology, ec ...
field that uses
scientific method
The scientific method is an Empirical evidence, empirical method for acquiring knowledge that has characterized the development of science since at least the 17th century (with notable practitioners in previous centuries; see the article hist ...
s, processes,
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s and systems to extract or extrapolate
knowledge
Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is disti ...
and insights from noisy, structured and
unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, n ...
, and apply knowledge from data across a broad range of application domains. Data science is related to
data mining,
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
,
big data,
computational statistics and
analytics
Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and communication of meaningful patterns in data. It also entails applying data patterns toward effective decision-making. It ...
.
Data science is a "concept to unify
statistics,
data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, en ...
,
informatics, and their related
methods" in order to "understand and analyse actual
phenomena
A phenomenon ( : phenomena) is an observable event. The term came into its modern philosophical usage through Immanuel Kant, who contrasted it with the noumenon, which ''cannot'' be directly observed. Kant was heavily influenced by Gottfried ...
" with
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
. It uses techniques and theories drawn from many fields within the context of
mathematics, statistics,
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
,
information science
Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information. ...
, and
domain knowledge.
However, data science is different from
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
and information science.
Turing Award
The ACM A. M. Turing Award is an annual prize given by the Association for Computing Machinery (ACM) for contributions of lasting and major technical importance to computer science. It is generally recognized as the highest distinction in compu ...
winner
Jim Gray imagined data science as a "fourth paradigm" of science (
empirical,
theoretical,
computational, and now data-driven) and asserted that "everything about science is changing because of the impact of
information technology
Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology system ...
" and the
data deluge.
A data scientist is someone who creates programming code and combines it with statistical knowledge to create insights from data.
Foundations
Data science is an
interdisciplinary
Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several other fields like sociology, anthropology, psychology, ec ...
field focused on extracting knowledge from typically
large
Large means of great size.
Large may also refer to:
Mathematics
* Arbitrarily large, a phrase in mathematics
* Large cardinal, a property of certain transfinite numbers
* Large category, a category with a proper class of objects and morphisms ...
data sets and applying the knowledge and insights from that data to
solve problems in a wide range of application domains. The field encompasses preparing data for analysis, formulating data science problems,
analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics,
data visualization,
information visualization,
data sonification, data
integration
Integration may refer to:
Biology
* Multisensory integration
* Path integration
* Pre-integration complex, viral genetic material used to insert a viral genome into a host genome
*DNA integration, by means of site-specific recombinase technolo ...
,
graphic design,
complex systems,
communication
Communication (from la, communicare, meaning "to share" or "to be in relation with") is usually defined as the transmission of information. The term may also refer to the message communicated through such transmissions or the field of inqu ...
and
business. Statistician
Nathan Yau, drawing on
Ben Fry, also links data science to
human–computer interaction
Human–computer interaction (HCI) is research in the design and the use of computer technology, which focuses on the interfaces between people ( users) and computers. HCI researchers observe the ways humans interact with computers and design ...
: users should be able to intuitively control and
explore
Exploration refers to the historical practice of discovering remote lands. It is studied by geographers and historians.
Two major eras of exploration occurred in human history: one of convergence, and one of divergence. The first, covering most ...
data. In 2015, the
American Statistical Association
The American Statistical Association (ASA) is the main professional organization for statisticians and related professionals in the United States. It was founded in Boston, Massachusetts on November 27, 1839, and is the second oldest continuous ...
identified
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
management, statistics and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, and
distributed and parallel systems as the three emerging foundational professional communities.
Relationship to statistics
Many statisticians, including
Nate Silver
Nathaniel Read Silver (born January 13, 1978) is an American statistician, writer, and poker player who analyzes baseball (see sabermetrics), basketball, and elections (see psephology). He is the founder and editor-in-chief of ''FiveThirtyEight ...
, have argued that data science is not a new field, but rather another name for statistics. Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data.
Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. from images, text, sensors, transactions or customer information, etc) and emphasizes prediction and action.
Andrew Gelman of
Columbia University
Columbia University (also known as Columbia, and officially as Columbia University in the City of New York) is a private research university in New York City. Established in 1754 as King's College on the grounds of Trinity Church in Manha ...
has described statistics as a nonessential part of data science.
Stanford professor
David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data-science program. He describes data science as an applied field growing out of traditional statistics.
Etymology
Early usage
In 1962,
John Tukey
John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
described a field he called "data analysis", which resembles modern data science.
In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing,
C. F. Jeff Wu used the term "data science" for the first time as an alternative name for statistics. Later, attendees at a 1992 statistics symposium at the
University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing.
The term "data science" has been traced back to 1974, when
Peter Naur proposed it as an alternative name for computer science.
In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic.
However, the definition was still in flux. After the 1985 lecture at the Chinese Academy of Sciences in Beijing, in 1997
C. F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting or limited to describing data. In 1998, Hayashi Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis.
During the 1990s, popular terms for the process of finding patterns in datasets (which were increasingly large) included "knowledge discovery" and "
data mining".
Modern usage
In 2012, technologists
Thomas H. Davenport and