statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...

can have any of various ''types''. Statistical data types include categorical (e.g.

country A country is a distinct part of the world, such as a state, nation, or other political entity. When referring to a specific polity, the term "country" may refer to a sovereign state, state with limited recognition, constituent country, ...

), directional ( angles or directions, e.g. wind measurements),

count Count (feminine: countess) is a historical title of nobility in certain European countries, varying in relative status, generally of middling rank in the hierarchy of nobility. Pine, L. G. ''Titles: How the King Became His Majesty''. New York: ...

(a whole number of events), or real intervals (e.g. measures of

temperature Temperature is a physical quantity that quantitatively expresses the attribute of hotness or coldness. Temperature is measurement, measured with a thermometer. It reflects the average kinetic energy of the vibrating and colliding atoms making ...

). The data type is a fundamental concept in statistics and controls what sorts of

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

s can logically be used to describe the variable, the permissible operations on the variable, the type of regression analysis used to predict the variable, etc. The concept of data type is similar to the concept of

level of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scale ...

, but more specific. For example, count data requires a different distribution (e.g. a

Poisson distribution In probability theory and statistics, the Poisson distribution () is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known const ...

binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...

) than non-negative real-valued data require, but both fall under the same

(a ratio scale). Various attempts have been made to produce a taxonomy of

levels of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to dependent and independent variables, variables. Psychologist Stanley Smith Stevens developed the best-known class ...

. The psychophysicist

Stanley Smith Stevens Stanley Smith Stevens (November 4, 1906 – January 18, 1973) was an American psychologist who founded Harvard's Psycho-Acoustic Laboratory, studying psychoacoustics, and he is credited with the introduction of Stevens's power law. Stevens aut ...

defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with

longitude Longitude (, ) is a geographic coordinate that specifies the east- west position of a point on the surface of the Earth, or another celestial body. It is an angular measurement, usually expressed in degrees and denoted by the Greek lett ...

and

measurements in degree

Celsius The degree Celsius is the unit of temperature on the Celsius temperature scale "Celsius temperature scale, also called centigrade temperature scale, scale based on 0 ° for the melting point of water and 100 ° for the boiling point ...

or degree

Fahrenheit The Fahrenheit scale () is a scale of temperature, temperature scale based on one proposed in 1724 by the German-Polish physicist Daniel Gabriel Fahrenheit (1686–1736). It uses the degree Fahrenheit (symbol: °F) as the unit. Several accou ...

), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation. Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either

discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory * Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit * Discrete group, ...

or continuous, due to their numerical nature. Such distinctions can often be loosely correlated with

data type In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...

in computer science, in that dichotomous categorical variables may be represented with the

Boolean data type In computer science, the Boolean (sometimes shortened to Bool) is a data type that has one of two possible values (usually denoted ''true'' and ''false'') which is intended to represent the two truth values of logic and Boolean algebra. It is na ...

, polytomous categorical variables with arbitrarily assigned

integer An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...

s in the integral data type, and continuous variables with the real data type involving floating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented. Other categorizations have been proposed. For example, Mosteller and Tukey (1977) distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990) described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998), van den Berg (1991). The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82).

Simple data types

The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded using

real number In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...

s, because the theory of

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

s often explicitly assumes that they hold real numbers.

Multivariate data types

Data that cannot be described using a single number are often shoehorned into

random vector In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge ...

s of real-valued

s, although there is an increasing tendency to treat them on their own. Some examples: *

Random vector In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge ...

s. The individual elements may or may not be

correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...

. Examples of distributions used to describe correlated random vectors are the multivariate normal distribution and multivariate t-distribution. In general, there may be arbitrary correlations between any elements and any others; however, this often becomes unmanageable above a certain size, requiring further restrictions on the correlated elements. * Random matrices. Random matrices can be laid out linearly and treated as random vectors; however, this may not be an efficient way of representing the correlations between different elements. Some probability distributions are specifically designed for random matrices, e.g. the matrix normal distribution and Wishart distribution. * Random sequences. These are sometimes considered to be the same as random vectors, but in other cases the term is applied specifically to cases where each random variable is only correlated with nearby variables (as in a

Markov model In probability theory, a Markov model is a stochastic model used to Mathematical model, model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, ...

). This is a particular case of a Bayes network and often used for very long sequences, e.g. gene sequences or lengthy text documents. A number of models are specifically designed for such sequences, e.g. hidden Markov models. * Random processes. These are similar to random sequences, but where the length of the sequence is indefinite or infinite and the elements in the sequence are processed one-by-one. This is often used for data that can be described as a

time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...

, e.g. the price of a stock on successive days. Random processes are also used to model values that vary continuously (e.g. the temperature at successive moments in time), rather than at discrete intervals. * Bayes networks. These correspond to aggregates of random variables described using graphical models, where individual random variables are linked in a graph structure with conditional distributions relating variables to nearby variables. **

Multilevel model Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the studen ...

s are subclasses of Bayes networks that can be thought of as having multiple levels of

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

. **

Random tree In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no :wikt:order, order and does not follow an intelligible pattern or com ...

s. These are a subclass of Bayes network, where the variables are linked in a tree structure. An example is the problem of

parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...

a sentence, when statistical parsing techniques are used, such as probabilistic context-free grammars (PCFG's). * Random fields. These represent the extension of random processes to multiple dimensions, and are common in

physics Physics is the scientific study of matter, its Elementary particle, fundamental constituents, its motion and behavior through space and time, and the related entities of energy and force. "Physical science is that department of knowledge whi ...

, where they are used in

statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...

to describe properties such as

force In physics, a force is an influence that can cause an Physical object, object to change its velocity unless counterbalanced by other forces. In mechanics, force makes ideas like 'pushing' or 'pulling' mathematically precise. Because the Magnitu ...

electric field An electric field (sometimes called E-field) is a field (physics), physical field that surrounds electrically charged particles such as electrons. In classical electromagnetism, the electric field of a single charge (or group of charges) descri ...

that can vary continuously over three dimensions (or four dimensions, when time is included). These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.

Comparison to programming data types

Most data types in statistics have comparable types in computer programming, and vice versa, as shown in the following table:

References

{{Reflist