
Computational statistics, or statistical computing, is the bond between
statistics and
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
. It means statistical methods that are enabled by using computational methods. It is the area of
computational science
Computational science, also known as scientific computing or scientific computation (SC), is a field in mathematics that uses advanced computing capabilities to understand and solve complex problems. It is an area of science that spans many disc ...
(or scientific computing) specific to the mathematical science of
statistics. This area is also developing rapidly, leading to calls that a broader concept of computing should be taught as part of general
statistical education
Statistics education is the practice of teaching and learning of statistics, along with the associated scholarly research.
Statistics is both a formal science and a practical theory of scientific inquiry, and both aspects are considered in sta ...
.
As in
traditional statistics the goal is to transform
raw data
Raw data, also known as primary data, are ''data'' (e.g., numbers, instrument readings, figures, etc.) collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores).
If a scientis ...
into
knowledge
Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is disti ...
,
[ Wegman, Edward J. �]
Computational Statistics: A New Agenda for Statistical Theory and Practice.
��
Journal of the Washington Academy of Sciences
', vol. 78, no. 4, 1988, pp. 310–322. ''JSTOR'' but the focus lies on
computer intensive
statistical methods
Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...
, such as cases with very large
sample size
Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a populatio ...
and non-homogeneous
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...
s.
The terms 'computational statistics' and 'statistical computing' are often used interchangeably, although Carlo Lauro (a former president of the
International Association for Statistical Computing) proposed making a distinction, defining 'statistical computing' as "the application of computer science to statistics",
and 'computational statistics' as "aiming at the design of algorithm for implementing
statistical methods on computers, including the ones unthinkable before the computer
age (e.g.
bootstrap,
simulation
A simulation is the imitation of the operation of a real-world process or system over time. Simulations require the use of models; the model represents the key characteristics or behaviors of the selected system or process, whereas the ...
), as well as to cope with analytically intractable problems"
'sic''">sic.html" ;"title="'sic">'sic''
The term 'Computational statistics' may also be used to refer to computationally ''intensive'' statistical methods including
resampling methods, Markov chain Monte Carlo">resampling (statistics)">resampling methods, Markov chain Monte Carlo methods, local regression, kernel density estimation, artificial neural networks and generalized additive models.
History
Though computational statistics is widely used today, it actually has a relatively short history of acceptance in the
statistics community. For the most part, the founders of the field of statistics relied on mathematics and asymptotic approximations in the development of computational statistical methodology.
In statistical field, the first use of the term “computer” comes in an article in the ''Journal of the American Statistical Association'' archives by
Robert P. Porter in 1891. The article discusses about the use of
Hermann Hollerith
Herman Hollerith (February 29, 1860 – November 17, 1929) was a German-American statistician, inventor, and businessman who developed an electromechanical tabulating machine for punched cards to assist in summarizing information and, later, i ...
’s machine in the 11th Census of the United States. Hermann Hollerith’s machine, also called
tabulating machine
The tabulating machine was an electromechanical machine designed to assist in summarizing information stored on punched cards. Invented by Herman Hollerith, the machine was developed to help process data for the 1890 U.S. Census. Later models ...
, was an
electromechanical
In engineering, electromechanics combines processes and procedures drawn from electrical engineering and mechanical engineering. Electromechanics focuses on the interaction of electrical and mechanical systems as a whole and how the two system ...
machine designed to assist in summarizing information stored on
punched cards
A punched card (also punch card or punched-card) is a piece of stiff paper that holds digital data represented by the presence or absence of holes in predefined positions. Punched cards were once common in data processing applications or to di ...
. It was invented by Herman Hollerith (February 29, 1860 – November 17, 1929), an American businessman, inventor, and statistician. His invention of the punched card tabulating machine was patented in 1884, and later was used in the 1890
Census
A census is the procedure of systematically acquiring, recording and calculating information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses in ...
of
the United States
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territor ...
. The advantages of the technology were immediately apparent. the 1880 Census, with about 50 million people, and it took over 7 years to tabulate. While in the 1890 Census, with over 62 million people, it took less than a year. This marks the beginning of the era of mechanized computational statistics and semiautomatic
data processing
Data processing is the collection and manipulation of digital data to produce meaningful information.
Data processing is a form of '' information processing'', which is the modification (processing) of information in any manner detectable by ...
systems.
In 1908,
William Sealy Gosset
William Sealy Gosset (13 June 1876 – 16 October 1937) was an English statistician, chemist and brewer who served as Head Brewer of Guinness and Head Experimental Brewer of Guinness and was a pioneer of modern statistics. He pioneered small sam ...
performed his now well-known
Monte Carlo method simulation which led to the discovery of the
Student’s t-distribution
In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situa ...
. With the help of computational methods, he also has plots of the empirical distributions overlaid on the corresponding theoretical distributions. The computer has revolutionized simulation and has made the replication of Gosset’s experiment little more than an exercise.
Later on, the scientists put forward computational ways of generating
pseudo-random deviates, performed methods to convert uniform deviates into other distributional forms using inverse
cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ev ...
or acceptance-rejection methods, and developed state-space methodology for
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
.
By the mid-1950s, A lot of work was being done of testing the generators for
randomness
In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...
. Most of the computers could refer to random number tables now. In 1958,
John Tukey
John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
’s jackknife was developed. It is as a method to reduce the
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group ...
of parameter estimates in samples under nonstandard conditions. This requires computers for practical implementations. To this point, computers have made many tedious statistical studies feasible.
Methods
Maximum likelihood estimation
Maximum likelihood estimation
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
is used to
estimate
Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...
the
parameters
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
of an assumed
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
, given some observed data. It is achieved by
maximizing a
likelihood function
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
so that the
observed data is most probable under the assumed
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
.
Monte Carlo method
Monte Carlo
Monte Carlo (; ; french: Monte-Carlo , or colloquially ''Monte-Carl'' ; lij, Munte Carlu ; ) is officially an administrative area of the Principality of Monaco, specifically the ward of Monte Carlo/Spélugues, where the Monte Carlo Casino i ...
a statistical method relies on repeated
random sampling
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attem ...
to obtain numerical results. The concept is to use
randomness
In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...
to solve problems that might be
deterministic
Determinism is a philosophical view, where all events are determined completely by previously existing causes. Deterministic theories throughout the history of philosophy have developed from diverse and sometimes overlapping motives and consi ...
in principle. They are often used in
physical
Physical may refer to:
*Physical examination
In a physical examination, medical examination, or clinical examination, a medical practitioner examines a patient for any possible medical signs or symptoms of a medical condition. It generally cons ...
and
mathematical
Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
problems and are most useful when it is difficult to use other approaches. Monte Carlo methods are mainly used in three problem classes:
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
,
numerical integration
In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral, and by extension, the term is also sometimes used to describe the numerical solution of differential equations ...
, and generating draws from a
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
.
Markov chain Monte Carlo
The
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
method creates samples from a continuous
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
, with
probability density proportional to a known function. These samples can be used to evaluate an integral over that variable, as its
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
or
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...
.The more steps are included, the more closely the distribution of the sample matches the actual desired distribution.
Applications
*
Computational biology
Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
*
Computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Computational physics
Computational physics is the study and implementation of numerical analysis to solve problems in physics for which a quantitative theory already exists. Historically, computational physics was the first application of modern computers in scienc ...
*
Computational mathematics
Computational mathematics is an area of mathematics devoted to the interaction between mathematics and computer computation.National Science Foundation, Division of Mathematical ScienceProgram description PD 06-888 Computational Mathematics 2006 ...
*
Computational materials science
Computational statistics journals
*''
Communications in Statistics - Simulation and Computation''
*''
Computational Statistics
Computational statistics, or statistical computing, is the bond between statistics and computer science. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific comput ...
''
*''
Computational Statistics & Data Analysis
''Computational Statistics & Data Analysis'' is a monthly peer-reviewed scientific journal covering research on and applications of computational statistics
Computational statistics, or statistical computing, is the bond between statistics and ...
''
*''
Journal of Computational and Graphical Statistics''
*''
Journal of Statistical Computation and Simulation''
*''
Journal of Statistical Software
The ''Journal of Statistical Software'' is a peer-reviewed open-access scientific journal that publishes papers related to statistical software. The ''Journal of Statistical Software'' was founded in 1996 by Jan de Leeuw of the Department of Statis ...
''
*''
The R Journal''
*''
The Stata Journal
Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fiel ...
''
*''
Statistics and Computing
''Statistics and Computing'' is a peer-reviewed academic journal that deals with statistics and computing. It was established in 1991 and is published by Springer
Springer or springers may refer to:
Publishers
* Springer Science+Business Media, ...
''
*''
Wiley Interdisciplinary Reviews Computational Statistics ''Wiley Interdisciplinary Reviews: Computational Statistics'' (WIREs Comp Stats) is a review journal for computational and statistical techniques in the sciences, from the perspectives of both computation and statistics. It contain both tutorial rev ...
''
Associations
*
International Association for Statistical Computing
See also
*
Algorithms for statistical classification
In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diag ...
*
Data science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
*
Statistical methods in artificial intelligence
*
Free statistical software
*
List of statistical algorithms
*
List of statistical packages
*
Machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
References
Further reading
Articles
*
*
Books
*
*
*
*
*
*
*
*
*{{Citation, title=Data Science: Scientific and Statistical Computing , first=Reda. R. , last=Gharieb, publisher=Noor Publishing, year=2017, isbn=978-3-330-97256-8
External links
Associations
International Association for Statistical ComputingStatistical Computing section of the American Statistical Association
Journals
Computational Statistics & Data AnalysisJournal of Computational & Graphical StatisticsStatistics and Computing
Numerical analysis
Computational fields of study
Mathematics of computing