London School of Economics Statistics Machine Room 1964

Computational statistics, or statistical computing, is the bond between statistics and

computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...

. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific computing) specific to the mathematical science of statistics. This area is also developing rapidly, leading to calls that a broader concept of computing should be taught as part of general

statistical education Statistics education is the practice of teaching and learning of statistics, along with the associated scholarly research. Statistics is both a formal science and a practical theory of scientific inquiry, and both aspects are considered in sta ...

. As in traditional statistics the goal is to transform raw data into

knowledge Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is disti ...

, Wegman, Edward J. �
Computational Statistics: A New Agenda for Statistical Theory and Practice.
��
Journal of the Washington Academy of Sciences
', vol. 78, no. 4, 1988, pp. 310–322. ''JSTOR'' but the focus lies on computer intensive

statistical methods Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...

, such as cases with very large sample size and non-homogeneous data sets. The terms 'computational statistics' and 'statistical computing' are often used interchangeably, although Carlo Lauro (a former president of the International Association for Statistical Computing) proposed making a distinction, defining 'statistical computing' as "the application of computer science to statistics", and 'computational statistics' as "aiming at the design of algorithm for implementing statistical methods on computers, including the ones unthinkable before the computer age (e.g. bootstrap,

simulation A simulation is the imitation of the operation of a real-world process or system over time. Simulations require the use of models; the model represents the key characteristics or behaviors of the selected system or process, whereas the ...

), as well as to cope with analytically intractable problems" 'sic''">sic.html" ;"title="'sic">'sic'' The term 'Computational statistics' may also be used to refer to computationally ''intensive'' statistical methods including resampling methods, Markov chain Monte Carlo">resampling (statistics)">resampling methods, Markov chain Monte Carlo methods, local regression, kernel density estimation, artificial neural networks and generalized additive models.

History

Though computational statistics is widely used today, it actually has a relatively short history of acceptance in the statistics community. For the most part, the founders of the field of statistics relied on mathematics and asymptotic approximations in the development of computational statistical methodology. In statistical field, the first use of the term “computer” comes in an article in the ''Journal of the American Statistical Association'' archives by Robert P. Porter in 1891. The article discusses about the use of

Hermann Hollerith Herman Hollerith (February 29, 1860 – November 17, 1929) was a German-American statistician, inventor, and businessman who developed an electromechanical tabulating machine for punched cards to assist in summarizing information and, later, i ...

’s machine in the 11th Census of the United States. Hermann Hollerith’s machine, also called

tabulating machine The tabulating machine was an electromechanical machine designed to assist in summarizing information stored on punched cards. Invented by Herman Hollerith, the machine was developed to help process data for the 1890 U.S. Census. Later models ...

, was an electromechanical machine designed to assist in summarizing information stored on punched cards. It was invented by Herman Hollerith (February 29, 1860 – November 17, 1929), an American businessman, inventor, and statistician. His invention of the punched card tabulating machine was patented in 1884, and later was used in the 1890

Census A census is the procedure of systematically acquiring, recording and calculating information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses in ...

the United States The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territor ...

. The advantages of the technology were immediately apparent. the 1880 Census, with about 50 million people, and it took over 7 years to tabulate. While in the 1890 Census, with over 62 million people, it took less than a year. This marks the beginning of the era of mechanized computational statistics and semiautomatic

data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of '' information processing'', which is the modification (processing) of information in any manner detectable by ...

systems. In 1908, William Sealy Gosset performed his now well-known Monte Carlo method simulation which led to the discovery of the

Student’s t-distribution In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situa ...

. With the help of computational methods, he also has plots of the empirical distributions overlaid on the corresponding theoretical distributions. The computer has revolutionized simulation and has made the replication of Gosset’s experiment little more than an exercise. Later on, the scientists put forward computational ways of generating pseudo-random deviates, performed methods to convert uniform deviates into other distributional forms using inverse

cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ev ...

or acceptance-rejection methods, and developed state-space methodology for

Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...

. By the mid-1950s, A lot of work was being done of testing the generators for

randomness In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...

. Most of the computers could refer to random number tables now. In 1958,

John Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...

’s jackknife was developed. It is as a method to reduce the

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group ...

of parameter estimates in samples under nonstandard conditions. This requires computers for practical implementations. To this point, computers have made many tedious statistical studies feasible.

Methods

Maximum likelihood estimation

Maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

is used to estimate the parameters of an assumed

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...

, given some observed data. It is achieved by maximizing a

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

so that the observed data is most probable under the assumed

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...

Monte Carlo method

Monte Carlo Monte Carlo (; ; french: Monte-Carlo , or colloquially ''Monte-Carl'' ; lij, Munte Carlu ; ) is officially an administrative area of the Principality of Monaco, specifically the ward of Monte Carlo/Spélugues, where the Monte Carlo Casino i ...

a statistical method relies on repeated

random sampling In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attem ...

to obtain numerical results. The concept is to use

to solve problems that might be deterministic in principle. They are often used in physical and

mathematical Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...

problems and are most useful when it is difficult to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization,

numerical integration In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral, and by extension, the term is also sometimes used to describe the numerical solution of differential equations ...

, and generating draws from a

Markov chain Monte Carlo

The

method creates samples from a continuous

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...

, with probability density proportional to a known function. These samples can be used to evaluate an integral over that variable, as its

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...

.The more steps are included, the more closely the distribution of the sample matches the actual desired distribution.

Applications

Computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...

Computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

* Computational physics *

Computational mathematics Computational mathematics is an area of mathematics devoted to the interaction between mathematics and computer computation.National Science Foundation, Division of Mathematical ScienceProgram description PD 06-888 Computational Mathematics 2006 ...

* Computational materials science

Computational statistics journals

*'' Communications in Statistics - Simulation and Computation'' *'' Computational Statistics'' *'' Computational Statistics & Data Analysis'' *'' Journal of Computational and Graphical Statistics'' *'' Journal of Statistical Computation and Simulation'' *'' Journal of Statistical Software'' *'' The R Journal'' *''

The Stata Journal Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fiel ...

'' *'' Statistics and Computing'' *'' Wiley Interdisciplinary Reviews Computational Statistics''