Structured prediction or structured output learning is an

umbrella term Hypernymy and hyponymy are the wikt:Wiktionary:Semantic relations, semantic relations between a generic term (''hypernym'') and a more specific term (''hyponym''). The hypernym is also called a ''supertype'', ''umbrella term'', or ''blanket term ...

for supervised

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

techniques that involves predicting structured objects, rather than

discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory * Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit * Discrete group, ...

or real values. Similar to commonly used supervised learning techniques, structured prediction models are typically trained by means of observed data in which the predicted value is compared to the

ground truth Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. Etymology The ''Oxford English Dictionary'' (s.v. ''ground ...

, and this is used to adjust the model parameters. Due to the complexity of the model and the interrelations of predicted variables, the processes of model training and inference are often computationally infeasible, so

approximate inference Approximate inference methods make it possible to learn realistic models from big data by trading off computation time for accuracy, when exact learning and inference are computationally intractable. Major methods classes *Laplace's approximation ...

and learning methods are used.

Applications

An example application is the problem of translating a

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

sentence into a syntactic representation such as a

parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...

. This can be seen as a structured prediction problem in which the structured output domain is the set of all possible parse trees. Structured prediction is used in a wide variety of domains including

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

(NLP),

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

, and

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

Example: sequence tagging

Sequence tagging is a class of problems prevalent in NLP in which input data are often sequential, for instance sentences of text. The sequence tagging problem appears in several guises, such as

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...

(POS tagging) and

named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pr ...

. In POS tagging, for example, each word in a sequence must be 'tagged' with a ''class label'' representing the type of word: : The main challenge of this problem is to resolve

ambiguity Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A com ...

: in the above example, the words "sentence" and "tagged" in English can also be

verbs A verb is a word that generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual description of English, the basic fo ...

. While this problem can be solved by simply performing

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

of individual tokens, this approach does not take into account the empirical fact that tags do not occur independently; instead, each tag displays a strong

conditional dependence In probability theory, conditional dependence is a relationship between two or more events that are dependent when a third event occurs.Introduction to Artificial Intelligence by Sebastian Thrun and Peter Norvig, 201"Unit 3: Conditional Dependen ...

on the tag of the previous word. This fact can be exploited in a sequence model such as a

hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consi ...

that predicts the entire tag sequence for a sentence (rather than just individual tags) via the

Viterbi algorithm The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events. This i ...

Techniques

Probabilistic

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. Graphical models are commonly used in ...

s form a large class of structured prediction models. In particular,

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Whi ...

s and

random field In physics and mathematics, a random field is a random function over an arbitrary domain (usually a multi-dimensional space such as \mathbb^n). That is, it is a function f(x) that takes on a random value at each point x \in \mathbb^n(or some other ...

s are popular. Other algorithms and models for structured prediction include

inductive logic programming Inductive logic programming (ILP) is a subfield of symbolic artificial intelligence which uses logic programming as a uniform representation for examples, background knowledge and hypotheses. The term "''inductive''" here refers to philosophical ...

case-based reasoning Case-based reasoning (CBR), broadly construed, is the process of solving new problems based on the solutions of similar past problems. In everyday life, an auto mechanic who fixes an engine by recalling another car that exhibited similar sympto ...

structured SVM The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured ...

Markov logic network A Markov logic network (MLN) is a probabilistic logic which applies the ideas of a Markov network to first-order logic, defining probability distributions on possible worlds on any given domain. History In 2002, Ben Taskar, Pieter Abbeel and ...

Probabilistic Soft Logic Probabilistic Soft Logic (PSL) is a statistical relational learning (SRL) framework for modeling probabilistic and relational domains. It is applicable to a variety of machine learning problems, such as collective classification, entity reso ...

, and

constrained conditional model A constrained conditional model (CCM) is a machine learning and inference framework that augments the learning of conditional (probabilistic or discriminative) models with declarative constraints. The constraint can be used as a way to incorporate ...

s. The main techniques are: *

Conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consi ...

s *

Structured support vector machine The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured ...

s * Structured ''k''-nearest neighbours *

Recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s, in particular Elman networks *

Transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...

Structured perceptron

One of the easiest ways to understand algorithms for general structured prediction is the structured perceptron by Collins. This algorithm combines the

perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

algorithm for learning

linear classifier In machine learning, a linear classifier makes a classification decision for each object based on a linear combination of its features. Such classifiers work well for practical problems such as document classification, and more generally for prob ...

s with an inference algorithm (classically the

when used on sequence data) and can be described abstractly as follows: # First, define a function

\phi(x,y)

that maps a training sample

x

and a candidate prediction

y

to a vector of length

n

(

x

and

y

may have any structure;

n

is problem-dependent, but must be fixed for each model). Let

GEN

be a function that generates candidate predictions. # Then: ::Let

w

be a weight vector of length

n

::For a predetermined number of iterations: :::For each sample

x

in the training set with true output

t

: ::::Make a prediction

\hat

\hat=\, \\,(w^T, \phi(x,y))

::::Update

w

(from

\hat

towards

t

w=w+c(-\phi(x, \hat)+ \phi(x,t))

, where

c

is the

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

. In practice, finding the argmax over

({x})

is done using an algorithm such as Viterbi or a max-sum, rather than an

exhaustive search In computer science, brute-force search or exhaustive search, also known as generate and test, is a very general problem-solving technique and algorithmic paradigm that consists of systematically checking all possible candidates for whether or ...

through an exponentially large set of candidates. The idea of learning is similar to that for multiclass perceptrons.

References

* Noah Smith
Linguistic Structure Prediction
2011. * Michael Collins
Discriminative Training Methods for Hidden Markov Models
2002.

External links

Implementation of Collins structured perceptron