In
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, a probabilistic classifier is a
classifier that is able to predict, given an observation of an input, a
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
over a
set
Set, The Set, SET or SETS may refer to:
Science, technology, and mathematics Mathematics
*Set (mathematics), a collection of elements
*Category of sets, the category whose objects and morphisms are sets and total functions, respectively
Electro ...
of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into
ensembles.
Types of classification
Formally, an "ordinary" classifier is some rule, or
function, that assigns to a sample a class label :
:
The samples come from some set (e.g., the set of all
documents
A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes a "teaching" or "lesson": ...
, or the set of all
images
An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a project ...
), while the class labels form a finite set defined prior to training.
Probabilistic classifiers generalize this notion of classifiers: instead of functions, they are
conditional distributions
, meaning that for a given
, they assign probabilities to all
(and these probabilities sum to one). "Hard" classification can then be done using the
optimal decision rule
:
or, in English, the predicted class is that which has the highest probability.
Binary probabilistic classifiers are also called
binary regression models in
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
. In
econometrics
Econometrics is an application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics", '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8 ...
, probabilistic classification in general is called
discrete choice
In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such c ...
.
Some classification models, such as
naive Bayes
In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...
,
logistic regression
In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
and
multilayer perceptron
In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...
s (when trained under an appropriate
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
) are naturally probabilistic. Other models such as
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
s are not, but
methods exist to turn them into probabilistic classifiers.
Generative and conditional training
Some models, such as
logistic regression
In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
, are conditionally trained: they optimize the conditional probability
directly on a training set (see
empirical risk minimization
In statistical learning theory, the principle of empirical risk minimization defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large num ...
). Other classifiers, such as
naive Bayes
In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...
, are trained
generatively: at training time, the class-conditional distribution
and the class
prior
The term prior may refer to:
* Prior (ecclesiastical), the head of a priory (monastery)
* Prior convictions, the life history and previous convictions of a suspect or defendant in a criminal case
* Prior probability, in Bayesian statistics
* Prio ...
are found, and the conditional distribution
is derived using
Bayes' rule.
Probability calibration
Not all classification models are naturally probabilistic, and some that are, notably naive Bayes classifiers,
decision trees
A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...
and
boosting methods, produce distorted class probability distributions.
In the case of decision trees, where is the proportion of training samples with label in the leaf where ends up, these distortions come about because learning algorithms such as
C4.5 or
CART
A cart or dray (Australia and New Zealand) is a vehicle designed for transport, using two wheels and normally pulled by draught animals such as horses, donkeys, mules and oxen, or even smaller animals such as goats or large dogs.
A handcart ...
explicitly aim to produce homogeneous leaves (giving probabilities close to zero or one, and thus high
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
) while using few samples to estimate the relevant proportion (high
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
).

Calibration can be assessed using a calibration plot (also called a reliability diagram).
A calibration plot shows the proportion of items in each class for bands of predicted probability or score (such as a distorted probability distribution or the "signed distance to the hyperplane" in a support vector machine). Deviations from the identity function indicate a poorly-calibrated classifier for which the predicted probabilities or scores can not be used as probabilities. In this case one can use a method to turn these scores into properly
calibrated
In measurement technology and metrology, calibration is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. Such a standard could be another measurement device of known ...
class membership probabilities.
For the
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two values (0 and 1) for each digit
* Binary function, a function that takes two arguments
* Binary operation, a mathematical op ...
case, a common approach is to apply
Platt scaling
In machine learning, Platt scaling or Platt calibration is a way of transforming the outputs of a statistical classification, classification model into a probabilistic classification, probability distribution over classes. The method was invented ...
, which learns a
logistic regression
In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
model on the scores.
An alternative method using
isotonic regression
In statistics and numerical analysis, isotonic regression or monotonic regression is the technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing function, non-decreasing (or non-increasing) ...
is generally superior to Platt's method when sufficient training data is available.
In the
multiclass case, one can use a reduction to binary tasks, followed by univariate calibration with an algorithm as described above and further application of the pairwise coupling algorithm by Hastie and Tibshirani.
Evaluating probabilistic classification
Commonly used evaluation metrics that compare the predicted probability to observed outcomes include
log loss,
Brier score
The Brier score is a strictly proper scoring rule that measures the accuracy of probabilistic predictions. For unidimensional predictions, it is strictly equivalent to the mean squared error as applied to predicted probabilities.
The Brier score ...
, and a variety of calibration errors. The former is also used as a loss function in the training of logistic models.
Calibration errors metrics aim to quantify the extent to which a probabilistic classifier's outputs are ''well-calibrated''. As
Philip Dawid put it, "a forecaster is well-calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent". Foundational work in the domain of measuring calibration error is the Expected Calibration Error (ECE) metric. More recent works propose variants to ECE that address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the
,1 including the Adaptive Calibration Error (ACE) and Test-based Calibration Error (TCE).
A method used to assign scores to pairs of predicted probabilities and actual discrete outcomes, so that different predictive methods can be compared, is called a
scoring rule
In decision theory, a scoring rule
provides evaluation metrics for probabilistic forecasting, probabilistic predictions or forecasts. While "regular" loss functions (such as mean squared error) assign a goodness-of-fit score to a predicted value ...
.
Software Implementations
* MoRPE
is a trainable probabilistic classifier that uses
isotonic regression
In statistics and numerical analysis, isotonic regression or monotonic regression is the technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing function, non-decreasing (or non-increasing) ...
for probability calibration. It solves the
multiclass case by reduction to binary tasks. It is a type of kernel machine that uses an inhomogeneous polynomial kernel.
References
{{reflist, 30em
Probabilistic models
Statistical classification