In
statistics, groups of individual
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
points may be classified as belonging to any of various statistical data types, e.g.
categorical ("red", "blue", "green"),
real number
In mathematics, a real number is a number that can be used to measurement, measure a ''continuous'' one-dimensional quantity such as a distance, time, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small var ...
(1.68, -5, 1.7e+6), odd number (1,3,5) etc. The data type is a fundamental component of the semantic content of the variable, and controls which sorts of
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
s can logically be used to describe the variable, the permissible operations on the variable, the type of
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
used to predict the variable, etc. The concept of data type is similar to the concept of
level of measurement, but more specific: For example,
count data require a different distribution (e.g. a
Poisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known ...
or
binomial distribution
In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no qu ...
) than non-negative
real-valued
In mathematics, value may refer to several, strongly related notions.
In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an ...
data require, but both fall under the same level of measurement (a ratio scale).
Various attempts have been made to produce a taxonomy of
levels of measurement. The psychophysicist
Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with
longitude
Longitude (, ) is a geographic coordinate that specifies the east– west position of a point on the surface of the Earth, or another celestial body. It is an angular measurement, usually expressed in degrees and denoted by the Greek let ...
and
temperature
Temperature is a physical quantity that expresses quantitatively the perceptions of hotness and coldness. Temperature is measured with a thermometer.
Thermometers are calibrated in various temperature scales that historically have relied on ...
measurements in degree
Celsius or degree
Fahrenheit
The Fahrenheit scale () is a temperature scale based on one proposed in 1724 by the physicist Daniel Gabriel Fahrenheit (1686–1736). It uses the degree Fahrenheit (symbol: °F) as the unit. Several accounts of how he originally defined h ...
), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.
Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as
categorical variables, whereas ratio and interval measurements are grouped together as
quantitative variables, which can be either
discrete or
continuous, due to their numerical nature. Such distinctions can often be loosely correlated with
data type
In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data. Most progra ...
in computer science, in that dichotomous categorical variables may be represented with the
Boolean data type
In computer science, the Boolean (sometimes shortened to Bool) is a data type that has one of two possible values (usually denoted ''true'' and ''false'') which is intended to represent the two truth values of logic and Boolean algebra. It is named ...
, polytomous categorical variables with arbitrarily assigned
integer
An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the language ...
s in the
integral data type, and continuous variables with the
real data type involving
floating point
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be r ...
computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.
Other categorizations have been proposed. For example,
Mosteller and
Tukey (1977) distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990) described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998), van den Berg (1991).
The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82).
[Hand, D. J. (2004). ''Measurement theory and practice: The world through quantification.'' London, UK: Arnold.]
Simple data types
The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded using
real number
In mathematics, a real number is a number that can be used to measurement, measure a ''continuous'' one-dimensional quantity such as a distance, time, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small var ...
s, because the theory of
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
s often explicitly assumes that they hold real numbers.
Multivariate data types
Data that cannot be described using a single number are often shoehorned into
random vectors of real-valued
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
s, although there is an increasing tendency to treat them on their own. Some examples:
*
Random vectors. The individual elements may or may not be
correlated
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statisti ...
. Examples of distributions used to describe correlated random vectors are the
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
and
multivariate t-distribution. In general, there may be arbitrary correlations between any elements and any others; however, this often becomes unmanageable above a certain size, requiring further restrictions on the correlated elements.
*
Random matrices. Random matrices can be laid out linearly and treated as random vectors; however, this may not be an efficient way of representing the correlations between different elements. Some probability distributions are specifically designed for random matrices, e.g. the
matrix normal distribution and
Wishart distribution.
*
Random sequences. These are sometimes considered to be the same as random vectors, but in other cases the term is applied specifically to cases where each random variable is only correlated with nearby variables (as in a
Markov model
In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Mark ...
). This is a particular case of a
Bayes network and often used for very long sequences, e.g. gene sequences or lengthy text documents. A number of models are specifically designed for such sequences, e.g.
hidden Markov models.
*
Random processes. These are similar to random sequences, but where the length of the sequence is indefinite or infinite and the elements in the sequence are processed one-by-one. This is often used for data that can be described as a
time series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
, e.g. the price of a stock on successive days. Random processes are also used to model values that vary continuously (e.g. the temperature at successive moments in time), rather than at discrete intervals.
*
Bayes networks. These correspond to aggregates of random variables described using
graphical model
A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability ...
s, where individual random variables are linked in a
graph
Graph may refer to:
Mathematics
*Graph (discrete mathematics), a structure made of vertices and edges
**Graph theory, the study of such graphs and their properties
*Graph (topology), a topological space resembling a graph in the sense of discre ...
structure with
conditional distributions relating variables to nearby variables.
**
Multilevel model
Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...
s are subclasses of Bayes networks that can be thought of as having multiple levels of
linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
.
**
Random trees. These are a subclass of Bayes network, where the variables are linked in a
tree structure
A tree structure, tree diagram, or tree model is a way of representing the hierarchical nature of a structure in a graphical form. It is named a "tree structure" because the classic representation resembles a tree, although the chart is genera ...
. An example is the problem of
parsing a sentence, when statistical parsing techniques are used, such as
probabilistic context-free grammars (PCFG's).
*
Random fields. These represent the extension of
random processes to multiple dimensions, and are common in
physics
Physics is the natural science that studies matter, its fundamental constituents, its motion and behavior through space and time, and the related entities of energy and force. "Physical science is that department of knowledge which rel ...
, where they are used in
statistical mechanics to describe properties such as
force
In physics, a force is an influence that can change the motion of an object. A force can cause an object with mass to change its velocity (e.g. moving from a state of rest), i.e., to accelerate. Force can also be described intuitively as a ...
or
electric field that can vary continuously over three dimensions (or four dimensions, when time is included).
These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.
References
{{Reflist