
Computational statistics, or statistical computing, is the bond between
statistics and
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
. It means statistical methods that are enabled by using computational methods. It is the area of
computational science (or scientific computing) specific to the mathematical science of
statistics. This area is also developing rapidly, leading to calls that a broader concept of computing should be taught as part of general
statistical education
Statistics education is the practice of teaching and learning of statistics, along with the associated scholarly research.
Statistics is both a formal science and a practical theory of scientific inquiry, and both aspects are considered in sta ...
.
As in
traditional statistics the goal is to transform
raw data into
knowledge
Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is disti ...
,
[ Wegman, Edward J. �]
Computational Statistics: A New Agenda for Statistical Theory and Practice.
��
Journal of the Washington Academy of Sciences
', vol. 78, no. 4, 1988, pp. 310–322. ''JSTOR'' but the focus lies on
computer intensive
statistical methods
Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...
, such as cases with very large
sample size and non-homogeneous
data sets.
The terms 'computational statistics' and 'statistical computing' are often used interchangeably, although Carlo Lauro (a former president of the
International Association for Statistical Computing) proposed making a distinction, defining 'statistical computing' as "the application of computer science to statistics",
and 'computational statistics' as "aiming at the design of algorithm for implementing
statistical methods on computers, including the ones unthinkable before the computer
age (e.g.
bootstrap,
simulation
A simulation is the imitation of the operation of a real-world process or system over time. Simulations require the use of models; the model represents the key characteristics or behaviors of the selected system or process, whereas the ...
), as well as to cope with analytically intractable problems"
'sic''">sic.html" ;"title="'sic">'sic''
The term 'Computational statistics' may also be used to refer to computationally ''intensive'' statistical methods including
resampling methods, Markov chain Monte Carlo">resampling (statistics)">resampling methods, Markov chain Monte Carlo methods, local regression, kernel density estimation, artificial neural networks and generalized additive models.
History
Though computational statistics is widely used today, it actually has a relatively short history of acceptance in the
statistics community. For the most part, the founders of the field of statistics relied on mathematics and asymptotic approximations in the development of computational statistical methodology.
In statistical field, the first use of the term “computer” comes in an article in the ''Journal of the American Statistical Association'' archives by
Robert P. Porter in 1891. The article discusses about the use of
Hermann Hollerith
Herman Hollerith (February 29, 1860 – November 17, 1929) was a German-American statistician, inventor, and businessman who developed an electromechanical tabulating machine for punched cards to assist in summarizing information and, later, i ...
’s machine in the 11th Census of the United States. Hermann Hollerith’s machine, also called
tabulating machine
The tabulating machine was an electromechanical machine designed to assist in summarizing information stored on punched cards. Invented by Herman Hollerith, the machine was developed to help process data for the 1890 U.S. Census. Later models ...
, was an
electromechanical machine designed to assist in summarizing information stored on
punched cards. It was invented by Herman Hollerith (February 29, 1860 – November 17, 1929), an American businessman, inventor, and statistician. His invention of the punched card tabulating machine was patented in 1884, and later was used in the 1890
Census
A census is the procedure of systematically acquiring, recording and calculating information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses in ...
of
the United States
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territor ...
. The advantages of the technology were immediately apparent. the 1880 Census, with about 50 million people, and it took over 7 years to tabulate. While in the 1890 Census, with over 62 million people, it took less than a year. This marks the beginning of the era of mechanized computational statistics and semiautomatic
data processing
Data processing is the collection and manipulation of digital data to produce meaningful information.
Data processing is a form of '' information processing'', which is the modification (processing) of information in any manner detectable by ...
systems.
In 1908,
William Sealy Gosset performed his now well-known
Monte Carlo method simulation which led to the discovery of the
Student’s t-distribution
In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situa ...
. With the help of computational methods, he also has plots of the empirical distributions overlaid on the corresponding theoretical distributions. The computer has revolutionized simulation and has made the replication of Gosset’s experiment little more than an exercise.
Later on, the scientists put forward computational ways of generating
pseudo-random deviates, performed methods to convert uniform deviates into other distributional forms using inverse
cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ev ...
or acceptance-rejection methods, and developed state-space methodology for
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
.
By the mid-1950s, A lot of work was being done of testing the generators for
randomness
In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...
. Most of the computers could refer to random number tables now. In 1958,
John Tukey
John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
’s jackknife was developed. It is as a method to reduce the
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group ...
of parameter estimates in samples under nonstandard conditions. This requires computers for practical implementations. To this point, computers have made many tedious statistical studies feasible.
Methods
Maximum likelihood estimation
Maximum likelihood estimation
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
is used to
estimate the
parameters of an assumed
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
, given some observed data. It is achieved by
maximizing a
likelihood function
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
so that the
observed data is most probable under the assumed
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
.
Monte Carlo method
Monte Carlo
Monte Carlo (; ; french: Monte-Carlo , or colloquially ''Monte-Carl'' ; lij, Munte Carlu ; ) is officially an administrative area of the Principality of Monaco, specifically the ward of Monte Carlo/Spélugues, where the Monte Carlo Casino i ...
a statistical method relies on repeated
random sampling
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attem ...
to obtain numerical results. The concept is to use
randomness
In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...
to solve problems that might be
deterministic in principle. They are often used in
physical and
mathematical
Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
problems and are most useful when it is difficult to use other approaches. Monte Carlo methods are mainly used in three problem classes:
optimization,
numerical integration
In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral, and by extension, the term is also sometimes used to describe the numerical solution of differential equations ...
, and generating draws from a
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
.
Markov chain Monte Carlo
The
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
method creates samples from a continuous
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
, with
probability density proportional to a known function. These samples can be used to evaluate an integral over that variable, as its
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
or
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...
.The more steps are included, the more closely the distribution of the sample matches the actual desired distribution.
Applications
*
Computational biology
Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
*
Computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Computational physics
*
Computational mathematics
Computational mathematics is an area of mathematics devoted to the interaction between mathematics and computer computation.National Science Foundation, Division of Mathematical ScienceProgram description PD 06-888 Computational Mathematics 2006 ...
*
Computational materials science
Computational statistics journals
*''
Communications in Statistics - Simulation and Computation''
*''
Computational Statistics''
*''
Computational Statistics & Data Analysis''
*''
Journal of Computational and Graphical Statistics''
*''
Journal of Statistical Computation and Simulation''
*''
Journal of Statistical Software''
*''
The R Journal''
*''
The Stata Journal
Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fiel ...
''
*''
Statistics and Computing''
*''
Wiley Interdisciplinary Reviews Computational Statistics''
Associations
*
International Association for Statistical Computing
See also
*
Algorithms for statistical classification
*
Data science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
*
Statistical methods in artificial intelligence
*
Free statistical software
*
List of statistical algorithms
*
List of statistical packages
*
Machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
References
Further reading
Articles
*
*
Books
*
*
*
*
*
*
*
*
*{{Citation, title=Data Science: Scientific and Statistical Computing , first=Reda. R. , last=Gharieb, publisher=Noor Publishing, year=2017, isbn=978-3-330-97256-8
External links
Associations
International Association for Statistical ComputingStatistical Computing section of the American Statistical Association
Journals
Computational Statistics & Data AnalysisJournal of Computational & Graphical StatisticsStatistics and Computing
Numerical analysis
Computational fields of study
Mathematics of computing