Pattern recognition is the task of assigning a

class Class, Classes, or The Class may refer to: Common uses not otherwise categorized * Class (biology), a taxonomic rank * Class (knowledge representation), a collection of individuals or objects * Class (philosophy), an analytical concept used d ...

to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical

data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...

image analysis Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading barcode, bar coded tags or a ...

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...

computer graphics Computer graphics deals with generating images and art with the aid of computers. Computer graphics is a core technology in digital photography, film, video games, digital art, cell phone and computer displays, and many specialized applications. ...

and

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of

, due to the increased availability of

big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...

and a new abundance of

processing power In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instruction ...

. Pattern recognition systems are commonly trained from labeled "training" data. When no

labeled data Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo conta ...

are available, other algorithms can be used to discover previously unknown patterns. KDD and data mining have a larger focus on unsupervised methods and stronger connection to business use. Pattern recognition focuses more on the signal and also takes acquisition and

into consideration. It originated in

engineering Engineering is the practice of using natural science, mathematics, and the engineering design process to Problem solving#Engineering, solve problems within technology, increase efficiency and productivity, and improve Systems engineering, s ...

, and the term is popular in the context of

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

: a leading computer vision conference is named

Conference on Computer Vision and Pattern Recognition The Conference on Computer Vision and Pattern Recognition is an annual conference on computer vision and pattern recognition. Affiliations The conference was first held in 1983 in Washington, DC, organized by Takeo Kanade and Dana H. Ballard. Fr ...

. In

, pattern recognition is the assignment of a label to a given input value. In statistics,

discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...

was introduced for this same purpose in 1936. An example of pattern recognition is

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

, which attempts to assign each input value to one of a given set of ''classes'' (for example, determine whether a given email is "spam"). Pattern recognition is a more general problem that encompasses other types of output as well. Other examples are

regression Regression or regressions may refer to: Arts and entertainment * ''Regression'' (film), a 2015 horror film by Alejandro Amenábar, starring Ethan Hawke and Emma Watson * ''Regression'' (magazine), an Australian punk rock fanzine (1982–1984) * ...

, which assigns a

real-valued In mathematics, value may refer to several, strongly related notions. In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an ...

output to each input;

sequence labeling In machine learning, sequence labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. A common example of a sequence labeling task is part of ...

, which assigns a class to each member of a sequence of values (for example,

part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its definiti ...

, which assigns a

part of speech In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ...

to each word in an input sentence); and

parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...

, which assigns a

parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...

to an input sentence, describing the

syntactic structure In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...

of the sentence. Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to perform "most likely" matching of the inputs, taking into account their statistical variation. This is opposed to ''

pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually must be exact: "either it will or will not be a ...

'' algorithms, which look for exact matches in the input with pre-existing patterns. A common example of a pattern-matching algorithm is

regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

matching, which looks for patterns of a given sort in textual data and is included in the search capabilities of many

text editor A text editor is a type of computer program that edits plain text. An example of such program is "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be used to c ...

s and

word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Early word processors were stand-alone devices dedicated to the function, but current word ...

Overview

A modern definition of pattern recognition is: Pattern recognition is generally categorized according to the type of learning procedure used to generate the output value. ''

Supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

'' assumes that a set of training data (the

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

) has been provided, consisting of a set of instances that have been properly labeled by hand with the correct output. A learning procedure then generates a model that attempts to meet two sometimes conflicting objectives: Perform as well as possible on the training data, and generalize as well as possible to new data (usually, this means being as simple as possible, for some technical definition of "simple", in accordance with

Occam's Razor In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; ) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle o ...

, discussed below).

Unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...

, on the other hand, assumes training data that has not been hand-labeled, and attempts to find inherent patterns in the data that can then be used to determine the correct output value for new data instances. A combination of the two that has been explored is

semi-supervised learning Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is charact ...

, which uses a combination of labeled and unlabeled data (typically a small set of labeled data combined with a large amount of unlabeled data). In cases of unsupervised learning, there may be no training data at all. Sometimes different terms are used to describe the corresponding supervised and unsupervised learning procedures for the same type of output. The unsupervised equivalent of classification is normally known as '' clustering'', based on the common perception of the task as involving no training data to speak of, and of grouping the input data into clusters based on some inherent

similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...

(e.g. the

distance Distance is a numerical or occasionally qualitative measurement of how far apart objects, points, people, or ideas are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two co ...

between instances, considered as vectors in a multi-dimensional

vector space In mathematics and physics, a vector space (also called a linear space) is a set (mathematics), set whose elements, often called vector (mathematics and physics), ''vectors'', can be added together and multiplied ("scaled") by numbers called sc ...

), rather than assigning each input instance into one of a set of pre-defined classes. In some fields, the terminology is different. In

community ecology In ecology, a community is a group or association of populations of two or more different species occupying the same geographical area at the same time, also known as a biocoenosis, biotic community, biological community, ecological communit ...

, the term ''classification'' is used to refer to what is commonly known as "clustering". The piece of input data for which an output value is generated is formally termed an ''instance''. The instance is formally described by a

vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...

of features, which together constitute a description of all known characteristics of the instance. These feature vectors can be seen as defining points in an appropriate multidimensional space, and methods for manipulating vectors in

s can be correspondingly applied to them, such as computing the

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...

or the angle between two vectors. Features typically are either categorical (also known as

nominal Nominal may refer to: Linguistics and grammar * Nominal (linguistics), one of the parts of speech * Nominal, the adjectival form of "noun", as in "nominal agreement" (= "noun agreement") * Nominal sentence, a sentence without a finite verb * Nou ...

, i.e., consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O"), ordinal (consisting of one of a set of ordered items, e.g., "large", "medium" or "small"), integer-valued (e.g., a count of the number of occurrences of a particular word in an email) or

(e.g., a measurement of blood pressure). Often, categorical and ordinal data are grouped together, and this is also the case for integer-valued and real-valued data. Many algorithms work only in terms of categorical data and require that real-valued or integer-valued data be ''discretized'' into groups (e.g., less than 5, between 5 and 10, or greater than 10).

Probabilistic classifiers

Many common pattern recognition algorithms are ''probabilistic'' in nature, in that they use

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

to find the best label for a given instance. Unlike other algorithms, which simply output a "best" label, often probabilistic algorithms also output a

probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

of the instance being described by the given label. In addition, many probabilistic algorithms output a list of the ''N''-best labels with associated probabilities, for some value of ''N'', instead of simply a single best label. When the number of possible labels is fairly small (e.g., in the case of

), ''N'' may be set so that the probability of all possible labels is output. Probabilistic algorithms have many advantages over non-probabilistic algorithms: *They output a confidence value associated with their choice. (Note that some other algorithms may also output confidence values, but in general, only for probabilistic algorithms is this value mathematically grounded in

probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

. Non-probabilistic confidence values can in general not be given any specific meaning, and only used to compare against other confidence values output by the same algorithm.) *Correspondingly, they can ''abstain'' when the confidence of choosing any particular output is too low. *Because of the probabilities output, probabilistic pattern-recognition algorithms can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of ''error propagation''.

Number of important feature variables

Feature selection In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * sim ...

algorithms attempt to directly prune out redundant or irrelevant features. A general introduction to

feature selection In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * sim ...

which summarizes approaches and challenges, has been given. The complexity of feature-selection is, because of its non-monotonous character, an

optimization problem In mathematics, engineering, computer science and economics Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goo ...

where given a total of

n

features the

powerset In mathematics, the power set (or powerset) of a set is the set of all subsets of , including the empty set and itself. In axiomatic set theory (as developed, for example, in the ZFC axioms), the existence of the power set of any set is po ...

consisting of all

2^n-1

subsets of features need to be explored. The Branch-and-Bound algorithm does reduce this complexity but is intractable for medium to large values of the number of available features

n

Techniques to transform the raw feature vectors (feature extraction) are sometimes used prior to application of the pattern-matching algorithm.

Feature extraction Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

algorithms attempt to reduce a large-dimensionality feature vector into a smaller-dimensionality vector that is easier to work with and encodes less redundancy, using mathematical techniques such as

principal components analysis Principal component analysis (PCA) is a Linear map, linear dimensionality reduction technique with applications in exploratory data analysis, visualization and Data Preprocessing, data preprocessing. The data is linear map, linearly transformed ...

(PCA). The distinction between feature selection and feature extraction is that the resulting features after feature extraction has taken place are of a different sort than the original features and may not easily be interpretable, while the features left after feature selection are simply a subset of the original features.

Problem statement

The problem of pattern recognition can be stated as follows: Given an unknown function

g:\mathcal\rightarrow\mathcal

(the ''ground truth'') that maps input instances

\boldsymbol \in \mathcal

to output labels

y \in \mathcal

, along with training data

\mathbf = \

assumed to represent accurate examples of the mapping, produce a function

h:\mathcal\rightarrow\mathcal

that approximates as closely as possible the correct mapping

g

. (For example, if the problem is filtering spam, then

\boldsymbol_i

is some representation of an email and

y

is either "spam" or "non-spam"). In order for this to be a well-defined problem, "approximates as closely as possible" needs to be defined rigorously. In

decision theory Decision theory or the theory of rational choice is a branch of probability theory, probability, economics, and analytic philosophy that uses expected utility and probabilities, probability to model how individuals would behave Rationality, ratio ...

, this is defined by specifying a

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

or cost function that assigns a specific value to "loss" resulting from producing an incorrect label. The goal then is to minimize the expected loss, with the expectation taken over the

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

\mathcal

. In practice, neither the distribution of

\mathcal

nor the ground truth function

g:\mathcal\rightarrow\mathcal

are known exactly, but can be computed only empirically by collecting a large number of samples of

\mathcal

and hand-labeling them using the correct value of

\mathcal

(a time-consuming process, which is typically the limiting factor in the amount of data of this sort that can be collected). The particular loss function depends on the type of label being predicted. For example, in the case of

, the simple

zero-one loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

is often sufficient. This corresponds simply to assigning a loss of 1 to any incorrect labeling and implies that the optimal classifier minimizes the error rate on independent test data (i.e. counting up the fraction of instances that the learned function

h:\mathcal\rightarrow\mathcal

labels wrongly, which is equivalent to maximizing the number of correctly classified instances). The goal of the learning procedure is then to minimize the error rate (maximize the correctness) on a "typical" test set. For a probabilistic pattern recognizer, the problem is instead to estimate the probability of each possible output label given a particular input instance, i.e., to estimate a function of the form :

p(, \boldsymbol,\boldsymbol\theta) = f\left(\boldsymbol;\boldsymbol\right)

where the

feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a data set. Choosing informative, discriminating, and independent features is crucial to produce effective algorithms for pattern re ...

input is

\boldsymbol

, and the function ''f'' is typically parameterized by some parameters

\boldsymbol

. In a discriminative approach to the problem, ''f'' is estimated directly. In a generative approach, however, the inverse probability

p()

is instead estimated and combined with the

prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

p(, \boldsymbol\theta)

using

Bayes' rule Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting conditional probabilities, allowing one to find the probability of a cause given its effect. For example, if the risk of develo ...

, as follows: :

p(, \boldsymbol,\boldsymbol\theta) = \frac.

When the labels are continuously distributed (e.g., in regression analysis), the denominator involves

integration Integration may refer to: Biology *Multisensory integration *Path integration * Pre-integration complex, viral genetic material used to insert a viral genome into a host genome *DNA integration, by means of site-specific recombinase technology, ...

rather than summation: :

p(, \boldsymbol,\boldsymbol\theta) = \frac.

The value of

\boldsymbol\theta

is typically learned using

maximum a posteriori An estimation procedure that is often claimed to be part of Bayesian statistics is the maximum a posteriori (MAP) estimate of an unknown quantity, that equals the mode of the posterior density with respect to some reference measure, typically ...

(MAP) estimation. This finds the best value that simultaneously meets two conflicting objects: To perform as well as possible on the training data (smallest error-rate) and to find the simplest possible model. Essentially, this combines

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

estimation with a

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

procedure that favors simpler models over more complex models. In a

Bayesian Thomas Bayes ( ; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian ( or ) may be either any of a range of concepts and approaches that relate to statistical methods based on Bayes' theorem Bayes ...

context, the regularization procedure can be viewed as placing a

p(\boldsymbol\theta)

on different values of

\boldsymbol\theta

. Mathematically: :

\boldsymbol\theta^* = \arg \max_ p(\boldsymbol\theta, \mathbf)

where

\boldsymbol\theta^*

is the value used for

\boldsymbol\theta

in the subsequent evaluation procedure, and

p(\boldsymbol\theta, \mathbf)

, the

posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posteri ...

\boldsymbol\theta

, is given by :

p(\boldsymbol\theta).

In the

approach to this problem, instead of choosing a single parameter vector

\boldsymbol^*

, the probability of a given label for a new instance

\boldsymbol

is computed by integrating over all possible values of

\boldsymbol\theta

, weighted according to the posterior probability: :

p(, \boldsymbol) = \int p(, \boldsymbol,\boldsymbol\theta)p(\boldsymbol, \mathbf) \operatorname\boldsymbol.

Frequentist or Bayesian approach to pattern recognition

The first pattern classifier – the linear discriminant presented by

Fisher Fisher is an archaic term for a fisherman, revived as gender-neutral. Fisher, Fishers or The Fisher may also refer to: Places Australia * Division of Fisher, an electoral district in the Australian House of Representatives, in Queensland *Elec ...

– was developed in the

frequentist Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...

tradition. The frequentist approach entails that the model parameters are considered unknown, but objective. The parameters are then computed (estimated) from the collected data. For the linear discriminant, these parameters are precisely the mean vectors and the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

. Also the probability of each class

p(, \boldsymbol\theta)

is estimated from the collected dataset. Note that the usage of ' Bayes rule' in a pattern classifier does not make the classification approach Bayesian.

Bayesian statistics Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...

has its origin in Greek philosophy where a distinction was already made between the '

a priori ('from the earlier') and ('from the later') are Latin phrases used in philosophy to distinguish types of knowledge, Justification (epistemology), justification, or argument by their reliance on experience. knowledge is independent from any ...

' and the '

a posteriori ('from the earlier') and ('from the later') are Latin phrases used in philosophy to distinguish types of knowledge, justification, or argument by their reliance on experience. knowledge is independent from any experience. Examples include ...

' knowledge. Later

Kant Immanuel Kant (born Emanuel Kant; 22 April 1724 – 12 February 1804) was a German philosopher and one of the central Enlightenment thinkers. Born in Königsberg, Kant's comprehensive and systematic works in epistemology, metaphysics, et ...

defined his distinction between what is a priori known – before observation – and the empirical knowledge gained from observations. In a Bayesian pattern classifier, the class probabilities

p(, \boldsymbol\theta)

can be chosen by the user, which are then a priori. Moreover, experience quantified as a priori parameter values can be weighted with empirical observations – using e.g., the Beta- (

conjugate prior In Bayesian probability theory, if, given a likelihood function p(x \mid \theta), the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posteri ...

) and Dirichlet-distributions. The Bayesian approach facilitates a seamless intermixing between expert knowledge in the form of subjective probabilities, and objective observations. Probabilistic pattern classifiers can be used according to a frequentist or a Bayesian approach.

Uses

800px-Cool Kids of Death Off Festival p 146-face selected

Within medical science, pattern recognition is the basis for

computer-aided diagnosis Computer-aided detection (CADe), also called computer-aided diagnosis (CADx), are systems that assist doctors in the interpretation of medical imaging, medical images. Imaging techniques in X-ray, MRI, endoscopy, and Medical ultrasound, ultraso ...

(CAD) systems. CAD describes a procedure that supports the doctor's interpretations and findings. Other typical applications of pattern recognition techniques are automatic

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

speaker identification Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to ''speaker recognition'' or speech recognition. Speaker verification ...

, classification of text into several categories (e.g., spam or non-spam email messages), the automatic recognition of handwriting on postal envelopes, automatic recognition of images of human faces, or handwriting image extraction from medical forms. The last two examples form the subtopic

of pattern recognition that deals with digital images as input to pattern recognition systems. Optical character recognition is an example of the application of a pattern classifier. The method of signing one's name was captured with stylus and overlay starting in 1990. The strokes, speed, relative min, relative max, acceleration and pressure is used to uniquely identify and confirm identity. Banks were first offered this technology, but were content to collect from the FDIC for any bank fraud and did not want to inconvenience customers. Pattern recognition has many real-world applications in image processing. Some examples include: * identification and authentication: e.g.,

license plate recognition Automatic number-plate recognition (ANPR; see also #Other names, other names below) is a technology that uses optical character recognition on images to read vehicle registration plates to create vehicle location data. It can use existing clos ...

, fingerprint analysis,

face detection Face detection is a computer technology being used in a variety of applications that identifies human faces in digital images. Face detection also refers to the psychological process by which humans locate and attend to faces in a visual scene ...

/verification, and voice-based authentication. * medical diagnosis: e.g., screening for cervical cancer (Papnet), breast tumors or heart sounds; * defense: various navigation and guidance systems, target recognition systems, shape recognition technology etc. * mobility:

advanced driver assistance systems Advanced driver-assistance systems (ADAS) are technologies that assist drivers with the safe operation of a vehicle. Through a human-machine interface, ADAS increases car and road safety. ADAS uses automated technology, such as sensors and came ...

, autonomous vehicle technology, etc. In psychology,

pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...

is used to make sense of and identify objects, and is closely related to perception. This explains how the sensory inputs humans receive are made meaningful. Pattern recognition can be thought of in two different ways. The first concerns template matching and the second concerns feature detection. A template is a pattern used to produce items of the same proportions. The template-matching hypothesis suggests that incoming stimuli are compared with templates in the long-term memory. If there is a match, the stimulus is identified. Feature detection models, such as the Pandemonium system for classifying letters (Selfridge, 1959), suggest that the stimuli are broken down into their component parts for identification. One observation is a capital E having three horizontal lines and one vertical line.

Algorithms

Algorithms for pattern recognition depend on the type of label output, on whether learning is supervised or unsupervised, and on whether the algorithm is statistical or non-statistical in nature. Statistical algorithms can further be categorized as generative or discriminative.

Classification methods (methods predicting categorical labels)

Parametric: *

Linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...

* Quadratic discriminant analysis *

Maximum entropy classifier In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...

(aka

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

multinomial logistic regression In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...

): Note that logistic regression is an algorithm for classification, despite its name. (The name comes from the fact that logistic regression uses an extension of a linear regression model to model the probability of an input being in a particular class.) Nonparametric:No distributional assumption regarding shape of feature distributions per class. *

Decision tree A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

decision list Decision lists are a representation for Boolean functions which can be easily learnable from examples. Single term decision lists are more expressive than disjunctions and conjunctions; however, 1-term decision lists are less expressive than the ...

s * Kernel estimation and K-nearest-neighbor algorithms *

Naive Bayes classifier In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...

Neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

(multi-layer perceptrons) *

Perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

s *

Support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

s *

Gene expression programming Gene expression programming (GEP) in computer programming is an evolutionary algorithm that creates computer programs or models. These computer programs are complex tree structures that learn and adapt by changing their sizes, shapes, and compos ...

Clustering methods (methods for classifying and predicting categorical labels)

*Categorical

mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observati ...

s *

Hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two ...

(agglomerative or divisive) *

K-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...

Correlation clustering Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. D ...

Kernel principal component analysis In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performe ...

(Kernel PCA)

Ensemble learning algorithms (supervised meta-algorithms for combining multiple learning algorithms together)

Boosting (meta-algorithm) In machine learning (ML), boosting is an ensemble metaheuristic for primarily reducing bias (as opposed to variance). It can also improve the stability and accuracy of ML classification and regression algorithms. Hence, it is prevalent in supe ...

Bootstrap aggregating Bootstrap aggregating, also called bagging (from bootstrap aggregating) or bootstrapping, is a machine learning (ML) ensemble meta-algorithm designed to improve the stability and accuracy of ML classification and regression algorithms. It also ...

("bagging") *

Ensemble averaging In machine learning, ensemble averaging is the process of creating multiple models (typically artificial neural networks) and combining them to produce a desired output, as opposed to creating just one model. Ensembles of models often outperform i ...

Mixture of experts Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning. They were also called committee machines. ...

, hierarchical mixture of experts

General methods for predicting arbitrarily-structured (sets of) labels

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Whi ...

s *

Markov random field In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph In discrete mathematics, particularly ...

Multilinear subspace learning algorithms (predicting labels of multidimensional data using tensor representations)

Unsupervised: *

Multilinear principal component analysis Multilinear principal component analysis (MPCA) is a multilinear extension of principal component analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visua ...

(MPCA)

Real-valued sequence labeling methods (predicting sequences of real-valued labels)

Kalman filter In statistics and control theory, Kalman filtering (also known as linear quadratic estimation) is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, to produce estimates of unk ...

s *

Particle filter Particle filters, also known as sequential Monte Carlo methods, are a set of Monte Carlo algorithms used to find approximate solutions for filtering problems for nonlinear state-space systems, such as signal processing and Bayesian statistical ...

Regression methods (predicting real-valued labels)

Gaussian process regression In statistics, originally in geostatistics, kriging or Kriging (), also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging g ...

(kriging) *

Linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

and extensions *

Independent component analysis In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate statistics, multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and ...

(ICA) *

Principal components analysis Principal component analysis (PCA) is a Linear map, linear dimensionality reduction technique with applications in exploratory data analysis, visualization and Data Preprocessing, data preprocessing. The data is linear map, linearly transformed ...

(PCA)

Sequence labeling methods (predicting sequences of categorical labels)

Conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consi ...

s (CRFs) *

Hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

s (HMMs) *

Maximum entropy Markov model In statistics, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discrimi ...

s (MEMMs) *

Recurrent neural networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

(RNNs) *

Dynamic time warping In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walk ...

(DTW)

References

External links

The International Association for Pattern Recognition

Journal of Pattern Recognition Research

Pattern Recognition Info

Pattern Recognition
(Journal of the Pattern Recognition Society)
International Journal of Pattern Recognition and Artificial Intelligence

International Journal of Applied Pattern Recognition

Open Pattern Recognition Project
intended to be an open source platform for sharing algorithms of pattern recognition
Improved Fast Pattern Matching
Improved Fast Pattern Matching {{Authority control Machine learning Formal sciences Computational fields of study

Overview

Probabilistic classifiers

Number of important feature variables

Problem statement

Frequentist or Bayesian approach to pattern recognition

Uses

Algorithms

Classification methods (methods predicting categorical labels)

Clustering methods (methods for classifying and predicting categorical labels)

Ensemble learning algorithms (supervised meta-algorithms for combining multiple learning algorithms together)

General methods for predicting arbitrarily-structured (sets of) labels

Multilinear subspace learning algorithms (predicting labels of multidimensional data using tensor representations)

Real-valued sequence labeling methods (predicting sequences of real-valued labels)

Regression methods (predicting real-valued labels)

Sequence labeling methods (predicting sequences of categorical labels)

See also

References

Further reading

External links