HOME

TheInfoList



OR:

Data science is an
interdisciplinary Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several other fields like sociology, anthropology, psychology, ec ...
field that uses
scientific method The scientific method is an empirical method for acquiring knowledge that has characterized the development of science since at least the 17th century (with notable practitioners in previous centuries; see the article history of scientifi ...
s, processes,
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s and systems to extract or extrapolate
knowledge Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is distin ...
and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains. Data science is related to data mining,
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
,
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
, computational statistics and
analytics Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and communication of meaningful patterns in data. It also entails applying data patterns toward effective decision-making. It ...
. Data science is a "concept to unify
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
,
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enc ...
,
informatics Informatics is the study of computational systems, especially those for data storage and retrieval. According to ACM ''Europe and'' '' Informatics Europe'', informatics is synonymous with computer science and computing as a profession, in which t ...
, and their related methods" in order to "understand and analyse actual
phenomena A phenomenon ( : phenomena) is an observable event. The term came into its modern philosophical usage through Immanuel Kant, who contrasted it with the noumenon, which ''cannot'' be directly observed. Kant was heavily influenced by Gottfried ...
" with
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
. It uses techniques and theories drawn from many fields within the context of
mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
, statistics,
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
,
information science Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information. ...
, and domain knowledge. However, data science is different from
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
and information science.
Turing Award The ACM A. M. Turing Award is an annual prize given by the Association for Computing Machinery (ACM) for contributions of lasting and major technical importance to computer science. It is generally recognized as the highest distinction in compu ...
winner Jim Gray imagined data science as a "fourth paradigm" of science (
empirical Empirical evidence for a proposition is evidence, i.e. what supports or counters this proposition, that is constituted by or accessible to sense experience or experimental procedure. Empirical evidence is of central importance to the sciences and ...
, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of
information technology Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology syste ...
" and the data deluge. A data scientist is someone who creates programming code and combines it with statistical knowledge to create insights from data.


Foundations

Data science is an
interdisciplinary Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several other fields like sociology, anthropology, psychology, ec ...
field focused on extracting knowledge from typically
large Large means of great size. Large may also refer to: Mathematics * Arbitrarily large, a phrase in mathematics * Large cardinal, a property of certain transfinite numbers * Large category, a category with a proper class of objects and morphisms ...
data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains. The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics,
data visualization Data and information visualization (data viz or info viz) is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is nu ...
, information visualization,
data sonification Data sonification is the presentation of data as sound using sonification. It is the auditory equivalent of the more established practice of data visualization. The usual process for data sonification is directing digital media of a dataset through ...
, data integration,
graphic design Graphic design is a profession, academic discipline and applied art whose activity consists in projecting visual communications intended to transmit specific messages to social groups, with specific objectives. Graphic design is an interdiscip ...
,
complex systems A complex system is a system composed of many components which may interact with each other. Examples of complex systems are Earth's global climate, organisms, the human brain, infrastructure such as power grid, transportation or communication sy ...
,
communication Communication (from la, communicare, meaning "to share" or "to be in relation with") is usually defined as the transmission of information. The term may also refer to the message communicated through such transmissions or the field of inqui ...
and
business Business is the practice of making one's living or making money by producing or buying and selling products (such as goods and services). It is also "any activity or enterprise entered into for profit." Having a business name does not separ ...
. Statistician
Nathan Yau Nathan Yau is an American statistician and data visualization expert. Early life Nathan Chun-Yin Yau grew up in Fresno, California. He received a Bachelor of Science in electrical engineering and computer science from the University of Calif ...
, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and
explore Exploration refers to the historical practice of discovering remote lands. It is studied by geographers and historians. Two major eras of exploration occurred in human history: one of convergence, and one of divergence. The first, covering most ...
data. In 2015, the American Statistical Association identified
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
management, statistics and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, and distributed and parallel systems as the three emerging foundational professional communities.


Relationship to statistics

Many statisticians, including
Nate Silver Nathaniel Read Silver (born January 13, 1978) is an American statistician, writer, and poker player who analyzes baseball (see sabermetrics), basketball, and elections (see psephology). He is the founder and editor-in-chief of '' FiveThirtyE ...
, have argued that data science is not a new field, but rather another name for statistics. Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data. Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. from images, text, sensors, transactions or customer information, etc) and emphasizes prediction and action.
Andrew Gelman Andrew Eric Gelman (born February 11, 1965) is an American statistician and professor of statistics and political science at Columbia University. Gelman received bachelor of science degrees in mathematics and in physics from MIT, where he w ...
of
Columbia University Columbia University (also known as Columbia, and officially as Columbia University in the City of New York) is a private research university in New York City. Established in 1754 as King's College on the grounds of Trinity Church in Manhatt ...
has described statistics as a nonessential part of data science. Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data-science program. He describes data science as an applied field growing out of traditional statistics.


Etymology


Early usage

In 1962, John Tukey described a field he called "data analysis", which resembles modern data science. In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C. F. Jeff Wu used the term "data science" for the first time as an alternative name for statistics. Later, attendees at a 1992 statistics symposium at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing. The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name for computer science. In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic. However, the definition was still in flux. After the 1985 lecture at the Chinese Academy of Sciences in Beijing, in 1997 C. F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting or limited to describing data. In 1998, Hayashi Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis. During the 1990s, popular terms for the process of finding patterns in datasets (which were increasingly large) included "knowledge discovery" and " data mining".


Modern usage

In 2012, technologists
Thomas H. Davenport Thomas Hayes "Tom" Davenport, Jr. (born October 17, 1954) is an American academic and author specializing in analytics, business process innovation, knowledge management, and artificial intelligence. He is currently the President’s Distinguish ...
and DJ Patil declared "Data Scientist: The Sexiest Job of the 21st Century", a catch-phrase that was picked up even by major-city newspapers like the
New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...
and the
Boston Globe ''The Boston Globe'' is an American daily newspaper founded and based in Boston, Massachusetts. The newspaper has won a total of 27 Pulitzer Prizes, and has a total circulation of close to 300,000 print and digital subscribers. ''The Boston Gl ...
. A decade later, they reaffirmed it, stating "the job is more in demand than ever with employers". The modern conception of data science as an independent discipline is sometimes attributed to
William S. Cleveland William Swain Cleveland II (born 1943) is an American computer scientist and Professor of Statistics and Professor of Computer Science at Purdue University, known for his work on data visualization, particularly on nonparametric regression and l ...
. In a 2001 paper, he advocated an expansion of statistics beyond theory into technical areas; because this would significantly change the field, it warranted a new name. "Data science" became more widely used in the next few years: in 2002, the
Committee on Data for Science and Technology The Committee on Data of the International Science Council (CODATA) was established in 1966 as the Committee on Data for Science and Technology, originally part of the International Council of Scientific Unions, now part of the International ...
launched ''Data Science Journal''. In 2003, Columbia University launched ''The Journal of Data Science''. In 2014, the American Statistical Association's Section on Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity of data science. The professional title of "data scientist" has been attributed to DJ Patil and
Jeff Hammerbacher Jeff Hammerbacher is a data scientist. He was chief scientist and cofounder at Cloudera and later served on the faculty of the Icahn School of Medicine at Mount Sinai. Early life Hammerbacher grew up in Fort Wayne, Indiana. His father worked at the ...
in 2008. Though it was used by the National Science Board in their 2005 report "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century", it referred broadly to any key role in managing a digital data collection. There is still no consensus on the definition of data science, and it is considered by some to be a
buzzword A buzzword is a word or phrase, new or already existing, that becomes popular for a period of time. Buzzwords often derive from technical terms yet often have much of the original technical meaning removed through fashionable use, being simply used ...
.
Big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
is a related marketing term. Data scientists are responsible for breaking down big data into usable information and creating software and algorithms that help companies and organizations determine optimal operations.


See also

* Open Data Science Conference * Scientific Data * Women in Data


References

{{Data Information science Computer occupations Computational fields of study Data analysis