Weak supervision is a branch of

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

setting. This approach alleviates the burden of obtaining hand-labeled data sets, which can be costly or impractical. Instead, inexpensive weak labels are employed with the understanding that they are imperfect, but can nonetheless be used to create a strong predictive model.

Problem of labeled training data

Machine learning models and techniques are increasingly accessible to researchers and developers; the real-world usefulness of these models, however, depends on access to high-quality labeled training data. This need for labeled training data often proves to be a significant obstacle to the application of machine learning models within an organization or industry. This bottleneck effect manifests itself in various ways, including the following examples: Insufficient quantity of labeled data When machine learning techniques are initially used in new applications or industries, there is often not enough training data available to apply traditional processes. Some industries have the benefit of decades' worth of training data readily available; those that do not are at a significant disadvantage. In such cases, obtaining training data may be impractical, expensive, or impossible without waiting years for its accumulation. Insufficient subject-matter expertise to label data When labeling training data requires specific relevant expertise, creation of a usable training data set can quickly become prohibitively expensive. This issue is likely to occur, for example, in

biomedical Biomedicine (also referred to as Western medicine, mainstream medicine or conventional medicine)

or security-related applications of machine learning. Insufficient time to label and prepare data Most of the time required to implement machine learning is spent in preparing data sets. When an industry or research field deals with problems that are, by nature, rapidly evolving, it can be impossible to collect and prepare data quickly enough for results to be useful in real-world applications. This issue could occur, for example, in

fraud detection In law, fraud is intentional deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud can violate civil law (e.g., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compensa ...

cybersecurity Computer security, cybersecurity (cyber security), or information technology security (IT security) is the protection of computer systems and networks from attack by malicious actors that may result in unauthorized information disclosure, t ...

applications. Other areas of machine learning exist that are likewise motivated by the demand for increased quantity and quality of labeled training data but employ different high-level techniques to approach this demand. These other approaches include

active learning Active learning is "a method of learning in which students are actively or experientially involved in the learning process and where there are different levels of active learning, depending on student involvement." states that "students partici ...

, semi-supervised learning, and

transfer learning Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize ...

Types of weak labels

Weak labels are intended to decrease the cost and increase the efficiency of human efforts expended in hand-labeling data. They can take many forms, and might be categorized into three types: * Global statistics on groups of inputs: This setting consists in accessing global information on bags of samples — ''e.g.'' knowing that half of the labels of a given subset of samples. Examples of global statistics supervision include multiple-instance learning and learning from label proportion. * Weak classifiers: A second approach consists in assuming the access to many weak classifiers that weakly correlate with the function to learn. Those classifiers might model labelers from a crowdsourcing platform, experts, noisy measurements or heuristic rules. More generally, developers may take advantage of existing resources (such as knowledge bases, alternative data sets, or pre-trained models) to create labels that are helpful, though not perfectly suited for the given task. * Incomplete annotation: Finally, weak supervision might be understood as the access to partial knowledge on each label. This partial knowledge can be thought of as a corruption process. In some instances, partial observation can be cast as a set of potential labels that are compatible with this partial observation, which is the setting of partial supervision. Partial supervision is a generalization of semi-supervised learning, which has been the classical approach to overcome the bottleneck of data annotation. Beyond those three settings, limitations that motivates weakly supervised learning might be tackled by leveraging human knowledge under the form of priors or of function architectures, reviving old approaches of artificial intelligence such as inductive logic programming.

Applications of weak supervision

Applications of weak supervision are numerous and varied within the machine learning research community. In 2014, researchers from

UC Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant uni ...

made use of the principles of weak supervision to propose an iterative learning algorithm that solely depends on labels generated by heuristics and alleviates the need of collecting any ground-truth labels. The algorithm was applied to smart meter data to learn about the household's occupancy without ever asking for the occupancy data, which has raised issues of privacy and security as covered by an article in IEEE Spectrum. Researchers from the

University of Southern California , mottoeng = "Let whoever earns the palm bear it" , religious_affiliation = Nonsectarian—historically Methodist , established = , accreditation = WSCUC , type = Private research university , academic_affiliations = , endowment = $8.1 ...

showed in 2017 that a very deep neural network can be trained to estimate a 3D face shape from a single image, using weakly supervised learning. The researchers noted the challenge of obtaining large amounts of face photos, in the wild, with their accompanying 3D, ground-truth, face shapes. Instead of collecting this data, they proposed generating labels, automatically, for faces in an existing face dataset. In their work, they used a classical, pre-deep learning method to estimate (noisy) 3D face shapes for the face images. They then average pooled the multiple 3D estimates thus obtained for different photos of the same person. These pooled estimates, generated for photos of different people, were used as weak supervision when training a deep network to regress 3D face shapes from single input face images. Importantly, their analysis showed that the network's 3D predictions were more accurate than estimates produced by the method they used to generate their proxy training labels, demonstrating the effectiveness of this approach. The same team later successfully applied similar weakly supervised methods, employing pre deep learning methods to automatically generate proxy labels for training networks to estimate

Six degrees of freedom Six degrees of freedom (6DOF) refers to the six mechanical degrees of freedom of movement of a rigid body in three-dimensional space. Specifically, the body is free to change position as forward/backward (surge), up/down (heave), left/right ...

(6DoF) head poses and 3D facial deformations. In 2018, researchers from UC Riverside proposed a method to localize actions/events in videos using only weak supervision, i.e., video-level labels, without any information about the start and end time of the events while training. Their work introduced an attention-based similarity between two videos, which acts as a regularizer for learning with weak labels. Thereafter in 2019, they introduced a new problem of event localization in videos using text queries from users, but with weak annotations while training. Later in a collaboration with NEC Laboratories America a similar attention-based alignment mechanism with weak labels was introduced for adapting a source semantic segmentation model to a target domain. When the weak labels of the target images are estimated using the source model, it is unsupervised

domain adaptation Domain adaptation is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. Fo ...

, requiring no target annotation cost, and when the weak labels are acquired from an annotator, it incurs a very small amount of annotation cost and falls under the category of weakly-supervised domain adaptation, which is first introduced in this work for semantic segmentation.

Stanford University Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is conside ...

researchers created Snorkel, an open-source system for quickly assembling training data through weak supervision. Snorkel employs the central principles of the data programming paradigm, in which developers create labeling functions, which are then used to programmatically label data, and employs supervised learning techniques to assess the accuracy of those labeling functions. In this way, potentially low-quality inputs can be used to create high-quality models. Afterward, the Stanford AI Lab researchers created Snorkel AI, which originated from the Snorkel project, using state-of-the-art programmatic data labeling and weak supervision approaches, successfully decreasing AI development costs and time significantly. In a joint work with

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

, Stanford researchers showed that existing organizational knowledge resources could be converted into weak supervision sources and used to significantly decrease development costs and time. In 2019,

Massachusetts Institute of Technology The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of th ...

and

researchers released cleanlab, the first standardized Python package for machine learning and

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

with noisy labels. Cleanlab implements

confident learning Confidence is a state of being clear-headed either that a hypothesis or prediction is correct or that a chosen course of action is the best or most effective. Confidence comes from a Latin word 'fidere' which means "to trust"; therefore, having ...

, a framework of theory and algorithms for dealing with uncertainty in dataset labels, to (1) find label errors in datasets, (2) characterize label noise, and (3) standardize and simplify research in weak supervision and learning with noisy labels. Researchers at

University of Massachusetts Amherst The University of Massachusetts Amherst (UMass Amherst, UMass) is a public research university in Amherst, Massachusetts and the sole public land-grant university in Commonwealth of Massachusetts. Founded in 1863 as an agricultural college, ...

propose augmenting traditional

approaches by soliciting labels on features rather than instances within a data set. Researchers at

Johns Hopkins University Johns Hopkins University (Johns Hopkins, Hopkins, or JHU) is a private research university in Baltimore, Maryland. Founded in 1876, Johns Hopkins is the oldest research university in the United States and in the western hemisphere. It consi ...

propose reducing the cost of labeling data sets by having annotators provide rationales supporting each of their data annotations, then using those rationales to train both discriminative and generative models for labeling additional data. Researchers at

University of Alberta The University of Alberta, also known as U of A or UAlberta, is a Public university, public research university located in Edmonton, Alberta, Canada. It was founded in 1908 by Alexander Cameron Rutherford,"A Gentleman of Strathcona – Alexande ...

propose a method that applies traditional active learning approaches to enhance the quality of the imperfect labels provided by weak supervision.

Semi-supervised learning

Semi-supervised learning is a special instance of weak supervision that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and

(with only labeled training data). Unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning. A set of

l

independently identically distributed examples

x_1,\dots,x_l \in X

with corresponding labels

y_1,\dots,y_l \in Y

and

u

unlabeled examples

x_,\dots,x_ \in X

are processed. Semi-supervised learning combines this information to surpass the

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

performance that can be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning. Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data

x_,\dots,x_

only. The goal of inductive learning is to infer the correct mapping from

X

Y

. Intuitively, the learning problem can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.

Assumptions

In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semi-supervised learning algorithms make use of at least one of the following assumptions:

Continuity / smoothness assumption

''Points that are close to each other are more likely to share a label.'' This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so few points are close to each other but in different classes.

Cluster assumption

''The data tend to form discrete clusters, and points in the same cluster are more likely to share a label'' (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.

Manifold assumption

''The data lie approximately on a

manifold In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or ''n-manifold'' for short, is a topological space with the property that each point has a n ...

of much lower dimension than the input space.'' In this case learning the manifold using both the labeled and unlabeled data can avoid the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...

. Then learning can proceed using distances and densities defined on the manifold. The manifold assumption is practical when high-dimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom. For instance, human voice is controlled by a few vocal folds, and images of various facial expressions are controlled by a few muscles. In these cases, it is better to consider distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images, respectively.

History

The heuristic approach of ''self-training'' (also known as ''self-learning'' or ''self-labeling'') is historically the oldest approach to semi-supervised learning, with examples of applications starting in the 1960s. The transductive learning framework was formally introduced by

Vladimir Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machin ...

in the 1970s. Interest in inductive learning using generative models also began in the 1970s. A ''probably approximately correct'' learning bound for semi-supervised learning of a

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponym ...

mixture was demonstrated by Ratsaby and Venkatesh in 1995. in . Cited in Semi-supervised learning has recently become more popular and practically relevant due to the variety of problems for which vast quantities of unlabeled data are available—e.g. text on websites, protein sequences, or images.

Methods

Generative models

Generative approaches to statistical learning first seek to estimate

p(x, y)

, the distribution of data points belonging to each class. The probability

p(y, x)

that a given point

x

has label

y

is then proportional to

p(x, y)p(y)

Bayes' rule In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For exa ...

. Semi-supervised learning with generative models can be viewed either as an extension of supervised learning (classification plus information about

p(x)

) or as an extension of unsupervised learning (clustering plus some labels). Generative models assume that the distributions take some particular form

p(x, y,\theta)

parameterized by the vector

\theta

. If these assumptions are incorrect, the unlabeled data may actually decrease the accuracy of the solution relative to what would have been obtained from labeled data alone. However, if the assumptions are correct, then the unlabeled data necessarily improves performance. The unlabeled data are distributed according to a mixture of individual-class distributions. In order to learn the mixture distribution from the unlabeled data, it must be identifiable, that is, different parameters must yield different summed distributions. Gaussian mixture distributions are identifiable and commonly used for generative models. The parameterized

joint distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

can be written as

p(x,y, \theta)=p(y, \theta)p(x, y,\theta)

by using the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...

. Each parameter vector

\theta

is associated with a decision function

f_\theta(x) = \underset\ p(y, x,\theta)

. The parameter is then chosen based on fit to both the labeled and unlabeled data, weighted by

\lambda

: :

\underset\left( \log p(\_^l ,  \theta) + \lambda \log p(\_^, \theta)\right)

Zhu, Xiaojin
Semi-Supervised Learning
University of Wisconsin-Madison.

Low-density separation

Another major class of methods attempts to place boundaries in regions with few data points (labeled or unlabeled). One of the most commonly used algorithms is the transductive support vector machine, or TSVM (which, despite its name, may be used for inductive learning as well). Whereas

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratori ...

for supervised learning seek a decision boundary with maximal

margin Margin may refer to: Physical or graphical edges * Margin (typography), the white space that surrounds the content of a page *Continental margin, the zone of the ocean floor that separates the thin oceanic crust from thick continental crust *Leaf ...

over the labeled data, the goal of TSVM is a labeling of the unlabeled data such that the decision boundary has maximal margin over all of the data. In addition to the standard hinge loss

(1-yf(x))_+

for labeled data, a loss function

(1-, f(x), )_+

is introduced over the unlabeled data by letting

y=\operatorname

. TSVM then selects

f^*(x) = h^*(x) + b

from a

reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...

\mathcal

by minimizing the regularized empirical risk: :

f^* = \underset\left( 
\displaystyle \sum_^l(1-y_if(x_i))_+ + \lambda_1 \, h\, _\mathcal^2 + \lambda_2 \sum_^ (1-, f(x_i), )_+
\right)

An exact solution is intractable due to the non-

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

term

(1-, f(x), )_+

, so research focuses on useful approximations. Other approaches that implement low-density separation include Gaussian process models, information regularization, and entropy minimization (of which TSVM is a special case).

Laplacian regularization

Laplacian regularization has been historically approached through graph-Laplacian. Graph-based methods for semi-supervised learning use a graph representation of the data, with a node for each labeled and unlabeled example. The graph may be constructed using domain knowledge or similarity of examples; two common methods are to connect each data point to its

k

nearest neighbors or to examples within some distance

\epsilon

. The weight

W_

of an edge between

x_i

and

x_j

is then set to

e^

. Within the framework of

manifold regularization In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire inpu ...

, the graph serves as a proxy for the manifold. A term is added to the standard Tikhonov regularization problem to enforce smoothness of the solution relative to the manifold (in the intrinsic space of the problem) as well as relative to the ambient input space. The minimization problem becomes :

\underset\left(
\frac\displaystyle\sum_^l V(f(x_i),y_i) + 
\lambda_A \, f\, ^2_\mathcal + 
\lambda_I \int_\mathcal\, \nabla_\mathcal f(x)\, ^2dp(x)
\right)

where

\mathcal

is a reproducing kernel

Hilbert space In mathematics, Hilbert spaces (named after David Hilbert) allow generalizing the methods of linear algebra and calculus from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise natural ...

and

\mathcal

is the manifold on which the data lie. The regularization parameters

\lambda_A

and

\lambda_I

control smoothness in the ambient and intrinsic spaces respectively. The graph is used to approximate the intrinsic regularization term. Defining the graph Laplacian

L = D - W

where

D_ = \sum_^ W_

and

\mathbf

is the vector

(x_1)\dots f(x_) /math>, we have

: \mathbf^T L \mathbf = \displaystyle\sum_^W_(f_i-f_j)^2 \approx \int_\mathcal\, \nabla_\mathcal f(x)\, ^2dp(x) .

The graph-based approach to Laplacian regularization is to put in relation with

finite difference method In numerical analysis, finite-difference methods (FDM) are a class of numerical techniques for solving differential equations by approximating derivatives with finite differences. Both the spatial domain and time interval (if applicable) are ...

. The Laplacian can also be used to extend the supervised learning algorithms: regularized least squares and support vector machines (SVM) to semi-supervised versions Laplacian regularized least squares and Laplacian SVM.

Heuristic approaches

Some methods for semi-supervised learning are not intrinsically geared to learning from both unlabeled and labeled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the labeled and unlabeled examples

x_1,\dots,x_

may inform a choice of representation, distance metric, or

kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learn ...

for the data in an unsupervised first step. Then supervised learning proceeds from only the labeled examples. In this vein, some methods learn a low-dimensional representation using the supervised data and then apply either low-density separation or graph-based methods to the learned representation. Iteratively refining the representation and then performing semi-supervised learning on said representation may further improve performance. ''Self-training'' is a wrapper method for semi-supervised learning. First a supervised learning algorithm is trained based on the labeled data only. This classifier is then applied to the unlabeled data to generate more labeled examples as input for the supervised learning algorithm. Generally only the labels the classifier is most confident in are added at each step.

Co-training Co-training is a machine learning algorithm used when there are only small amounts of labeled data and large amounts of unlabeled data. One of its uses is in text mining for search engines. It was introduced by Avrim Blum and Tom Mitchell in 1998. ...

is an extension of self-training in which multiple classifiers are trained on different (ideally disjoint) sets of features and generate labeled examples for one another.

In human cognition

Human responses to formal semi-supervised learning problems have yielded varying conclusions about the degree of influence of the unlabeled data. More natural learning problems may also be viewed as instances of semi-supervised learning. Much of human concept learning involves a small amount of direct instruction (e.g. parental labeling of objects during childhood) combined with large amounts of unlabeled experience (e.g. observation of objects without naming or counting them, or at least without feedback). Human infants are sensitive to the structure of unlabeled natural categories such as images of dogs and cats or male and female faces. Infants and children take into account not only unlabeled examples, but the sampling process from which labeled examples arise.

References

Sources

External links

Manifold Regularization
A freely available

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...

implementation of the graph-based semi-supervised algorithms Laplacian support vector machines and Laplacian regularized least squares.
KEEL: A software tool to assess evolutionary algorithms for Data Mining problems (regression, classification, clustering, pattern mining and so on)
KEEL module for semi-supervised learning.

Semi-Supervised Learning Software

Semi-supervised learning in scikit-learn. {{Differentiable computing Machine learning