statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

, groups of individual

data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...

points may be classified as belonging to any of various statistical data types, e.g. categorical ("red", "blue", "green"),

real number In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every ...

(1.68, -5, 1.7e+6), odd number (1,3,5) etc. The data type is a fundamental component of the semantic content of the variable, and controls which sorts of

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...

s can logically be used to describe the variable, the permissible operations on the variable, the type of

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

used to predict the variable, etc. The concept of data type is similar to the concept of

level of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scal ...

, but more specific: For example, count data require a different distribution (e.g. a

Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...

binomial distribution In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no ques ...

) than non-negative real-valued data require, but both fall under the same level of measurement (a ratio scale). Various attempts have been made to produce a taxonomy of

levels of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scal ...

. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with

longitude Longitude (, ) is a geographic coordinate that specifies the east– west position of a point on the surface of the Earth, or another celestial body. It is an angular measurement, usually expressed in degrees and denoted by the Greek let ...

and

temperature Temperature is a physical quantity that expresses quantitatively the perceptions of hotness and coldness. Temperature is measured with a thermometer. Thermometers are calibrated in various temperature scales that historically have relied o ...

measurements in degree

Celsius The degree Celsius is the unit of temperature on the Celsius scale (originally known as the centigrade scale outside Sweden), one of two temperature scales used in the International System of Units (SI), the other being the Kelvin scale. The d ...

or degree

Fahrenheit The Fahrenheit scale () is a temperature scale based on one proposed in 1724 by the physicist Daniel Gabriel Fahrenheit (1686–1736). It uses the degree Fahrenheit (symbol: °F) as the unit. Several accounts of how he originally defined hi ...

), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation. Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as

categorical variable In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...

s, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous g ...

, due to their numerical nature. Such distinctions can often be loosely correlated with

data type In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data. Most progra ...

in computer science, in that dichotomous categorical variables may be represented with the

Boolean data type In computer science, the Boolean (sometimes shortened to Bool) is a data type that has one of two possible values (usually denoted ''true'' and ''false'') which is intended to represent the two truth values of logic and Boolean algebra. It is named ...

, polytomous categorical variables with arbitrarily assigned

integer An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the languag ...

s in the integral data type, and continuous variables with the

real data type A real data type is a data type used in a computer program to represent an approximation of a real number. Because the real numbers are not countable, computers cannot represent them exactly using a finite amount of information. Most often, a compu ...

involving

floating point In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can ...

computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented. Other categorizations have been proposed. For example, Mosteller and Tukey (1977) distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990) described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998), van den Berg (1991). The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82).Hand, D. J. (2004). ''Measurement theory and practice: The world through quantification.'' London, UK: Arnold.

Simple data types

The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded using

s, because the theory of

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

s often explicitly assumes that they hold real numbers.

Multivariate data types

Data that cannot be described using a single number are often shoehorned into

random vector In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value ...

s of real-valued

s, although there is an increasing tendency to treat them on their own. Some examples: *

Random vector In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value ...

s. The individual elements may or may not be

correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...

. Examples of distributions used to describe correlated random vectors are the

multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...

and

multivariate t-distribution In statistics, the multivariate ''t''-distribution (or multivariate Student distribution) is a multivariate probability distribution. It is a generalization to random vectors of the Student's ''t''-distribution, which is a distribution applica ...

. In general, there may be arbitrary correlations between any elements and any others; however, this often becomes unmanageable above a certain size, requiring further restrictions on the correlated elements. * Random matrices. Random matrices can be laid out linearly and treated as random vectors; however, this may not be an efficient way of representing the correlations between different elements. Some probability distributions are specifically designed for random matrices, e.g. the

matrix normal distribution In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables. Definition The probability density ...

and

Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions defi ...

. * Random sequences. These are sometimes considered to be the same as random vectors, but in other cases the term is applied specifically to cases where each random variable is only correlated with nearby variables (as in a

Markov model In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Mark ...

). This is a particular case of a Bayes network and often used for very long sequences, e.g. gene sequences or lengthy text documents. A number of models are specifically designed for such sequences, e.g.

hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ...

s. *

Random process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...

es. These are similar to random sequences, but where the length of the sequence is indefinite or infinite and the elements in the sequence are processed one-by-one. This is often used for data that can be described as a

time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Ex ...

, e.g. the price of a stock on successive days. Random processes are also used to model values that vary continuously (e.g. the temperature at successive moments in time), rather than at discrete intervals. * Bayes networks. These correspond to aggregates of random variables described using

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probabili ...

s, where individual random variables are linked in a

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...

structure with

conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...

s relating variables to nearby variables. **

Multilevel model Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...

s are subclasses of Bayes networks that can be thought of as having multiple levels of

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...

. **

Random tree In mathematics and computer science, a random tree is a tree or arborescence that is formed by a stochastic process. Types of random trees include: *Uniform spanning tree, a spanning tree of a given graph in which each different tree is equally ...

s. These are a subclass of Bayes network, where the variables are linked in a

tree structure A tree structure, tree diagram, or tree model is a way of representing the hierarchical nature of a structure in a graphical form. It is named a "tree structure" because the classic representation resembles a tree, although the chart is genera ...

. An example is the problem of

parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from ...

a sentence, when statistical parsing techniques are used, such as probabilistic context-free grammars (PCFG's). *

Random field In physics and mathematics, a random field is a random function over an arbitrary domain (usually a multi-dimensional space such as \mathbb^n). That is, it is a function f(x) that takes on a random value at each point x \in \mathbb^n(or some other ...

s. These represent the extension of

random process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...

es to multiple dimensions, and are common in

physics Physics is the natural science that studies matter, its fundamental constituents, its motion and behavior through space and time, and the related entities of energy and force. "Physical science is that department of knowledge which ...

, where they are used in

statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. It does not assume or postulate any natural laws, but explains the macroscopic b ...

to describe properties such as

force In physics, a force is an influence that can change the motion of an object. A force can cause an object with mass to change its velocity (e.g. moving from a state of rest), i.e., to accelerate. Force can also be described intuitively as a ...

electric field An electric field (sometimes E-field) is the physical field that surrounds electrically charged particles and exerts force on all other charged particles in the field, either attracting or repelling them. It also refers to the physical field ...

that can vary continuously over three dimensions (or four dimensions, when time is included). These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.

References

{{Reflist