Multi-task learning (MTL) is a subfield of

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints". In a widely cited 1997 paper, Rich Caruana gave the following characterization:

Multitask Learning is an approach to
inductive transfer Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize ...
that improves
generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characte ...
by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.

In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance. Further examples of settings for MTL include

multiclass classification In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...

and multi-label classification. Multi-task learning works because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled. However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf

Methods

Task grouping and overlap

Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with

sparsity In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse b ...

, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases. Task relatedness can be imposed a priori or learned from the data. Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.

Exploiting unrelated tasks

One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf

Transfer of knowledge

Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep

convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

GoogLeNet, an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.

Group online adaptive learning

Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed Group online adaptive learning (GOAL). Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.

Mathematics

Reproducing Hilbert space of vector valued functions (RKHSvv)

The MTL problem can be cast within the context of RKHSvv (a complete

inner product space In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...

vector-valued function A vector-valued function, also referred to as a vector function, is a mathematical function of one or more variables whose range is a set of multidimensional vectors or infinite-dimensional vectors. The input of a vector-valued function could ...

s equipped with a

reproducing kernel In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...

). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al., 2015.

RKHSvv concepts

Suppose the training data set is

\mathcal_t =\_^

, with

x_i^t\in\mathcal

y_i^t\in\mathcal

, where indexes task, and

t \in 1,...,T

. Let

n=\sum_^Tn_t

. In this setting there is a consistent input and output space and the same loss function

\mathcal:\mathbb\times\mathbb\rightarrow \mathbb_+

for each task: . This results in the regularized machine learning problem: where

\mathcal

is a vector valued reproducing kernel Hilbert space with functions

f:\mathcal X \rightarrow \mathcal^T

having components

f_t:\mathcal\rightarrow \mathcal

. The reproducing kernel for the space

\mathcal

of functions

f:\mathcal X \rightarrow \mathbb^T

is a symmetric matrix-valued function

\Gamma :\mathcal X\times \mathcal X \rightarrow \mathbb^

, such that

\Gamma (\cdot ,x)c\in \mathcal

and the following reproducing property holds: The reproducing kernel gives rise to a representer theorem showing that any solution to equation has the form:

Separable kernels

The form of the kernel induces both the representation of the

feature space In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...

and structures the output across tasks. A natural simplification is to choose a ''separable kernel,'' which factors into separate kernels on the input space and on the tasks

\

. In this case the kernel relating scalar components

f_t

and

f_s

is given by

\gamma((x_i,t),(x_j,s )) = k(x_i,x_j)k_T(s,t)=k(x_i,x_j)A_

. For vector valued functions

f\in \mathcal H

we can write

\Gamma(x_i,x_j)=k(x_i,x_j)A

, where is a scalar reproducing kernel, and is a symmetric positive semi-definite

T\times T

matrix. Henceforth denote

S_+^T=\ \subset \mathbb R^

. This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by . Methods for non-separable kernels is an current field of research. For the separable case, the representation theorem is reduced to

f(x)=\sum _ ^N k(x,x_i)Ac_i

. The model output on the training data is then , where is the

n \times n

empirical kernel matrix with entries

K_=k(x_i,x_j)

, and is the

n \times T

matrix of rows

c_i

. With the separable kernel, equation can be rewritten as where is a (weighted) average of applied entry-wise to and . (The weight is zero if

Y_i^t

is a missing observation). Note the second term in can be derived as follows: :

\begin
\, f\, ^2_\mathcal &= \left\langle \sum _ ^n k(\cdot,x_i)Ac_i, \sum _ ^n k(\cdot ,x_j)Ac_j \right\rangle_
\\
&= \sum _ ^n  \langle   k(\cdot,x_i)A c_i, k(\cdot ,x_j)Ac_j\rangle_   & \text
\\
&= \sum _ ^n \langle k(x_i,x_j)A c_i, c_j\rangle_   & \text
\\
&= \sum _ ^n k(x_i,x_j) c_i^\top A  c_j=tr(KCAC^\top ) 
\end

Known task structure

= Task structure representations

= There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.

= Task structure examples

= Via the regularizer formulation, one can represent a variety of task structures easily. * Letting

A^\dagger = \gamma I_T + ( \gamma - \lambda)\frac  T \mathbf\mathbf^\top

(where

I_T

is the ''T''x''T'' identity matrix, and

\mathbf\mathbf^\top

is the ''T''x''T'' matrix of ones) is equivalent to letting control the variance

\sum_t , ,  f_t - \bar f, ,  _

of tasks from their mean

\frac 1 T \sum_t f_t

. For example, blood levels of some biomarker may be taken on patients at

n_t

time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients. * Letting

A^\dagger = \alpha I_T +(\alpha - \lambda )M

, where

M_ = \frac 1  \mathbb I(t,s\in G_r)

is equivalent to letting

\alpha

control the variance measured with respect to a group mean:

\sum _ \sum _ , , f_t - \frac 1  \sum _ f_s, ,

. (Here

, G_r,

the cardinality of group r, and

\mathbb I

is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group. * Letting

A^\dagger = \delta I_T + (\delta -\lambda)L

, where

L=D-M

is the Laplacian for the graph with

adjacency matrix In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In the special case of a finite simp ...

''M'' giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks ''t'' and ''s'' when they are more similar (according to the weight

M_

,) i.e.

\delta

regularizes

\sum _, , f_t - f_s , , _^2 M_

. * All of the above choices of A also induce the additional regularization term

\lambda \sum_t , , f, ,  _ ^2

which penalizes complexity in f more broadly.

Learning tasks together with their structure

Learning problem can be generalized to admit learning task matrix A as follows: Choice of

F:S_+^T\rightarrow \mathbb R_+

must be designed to learn matrices ''A'' of a given type. See "Special cases" below.

= Optimization of

= Restricting to the case of

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

losses and

coercive Coercion () is compelling a party to act in an involuntary manner by the use of threats, including threats to use force against a party. It involves a set of forceful actions which violate the free will of an individual in order to induce a desi ...

penalties Ciliberto ''et al.'' have shown that although is not convex jointly in ''C'' and ''A,'' a related problem is jointly convex. Specifically on the convex set

\mathcal C=\

, the equivalent problem is convex with the same minimum value. And if

(C_R, A_R)

is a minimizer for then

(C_R A^\dagger _R, A_R)

is a minimizer for . may be solved by a barrier method on a closed set by introducing the following perturbation: The perturbation via the barrier

\delta ^2 tr(A^\dagger)

forces the objective functions to be equal to

+\infty

on the boundary of

R^\times S_+^T

. can be solved with a block coordinate descent method, alternating in ''C'' and ''A.'' This results in a sequence of minimizers

(C_m,A_m)

in that converges to the solution in as

\delta_m \rightarrow 0

, and hence gives the solution to .

= Special cases

= Spectral penalties - Dinnuzo ''et al'' suggested setting ''F'' as the Frobenius norm

\sqrt

. They optimized directly using block coordinate descent, not accounting for difficulties at the boundary of

\mathbb R^ \times S_+^T

. Clustered tasks learning - Jacob ''et al'' suggested to learn ''A'' in the setting where ''T'' tasks are organized in ''R'' disjoint clusters. In this case let

E\in \^

be the matrix with

E_=\mathbb I (\textt\in \textr)

. Setting

M = I - E^\dagger E^T

, and

U = \frac 1 T \mathbf^\top

, the task matrix

A^\dagger

can be parameterized as a function of

M

A^\dagger(M) = \epsilon _M U+\epsilon_B (M-U)+\epsilon (I-M)

, with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. M is not convex, but there is a convex relaxation

\mathcal S_c = \

. In this formulation,

F(A)=\mathbb I(A(M)\in \)

= Generalizations

= Non-convex penalties - Penalties can be constructed such that A is constrained to be a graph Laplacian, or that A has low rank factorization. However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases. Non-separable kernels - Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.

Applications

Spam filtering

Using the principles of MTL, techniques for collaborative

spam filtering Various anti-spam techniques are used to prevent email spam (unsolicited bulk email). No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email (false positives) as opposed to ...

that facilitates personalization have been proposed. In large scale open membership email systems, most users do not label enough messages for an individual local classifier to be effective, while the data is too noisy to be used for a global filter across all users. A hybrid global/individual classifier can be effective at absorbing the influence of users who label emails very diligently from the general public. This can be accomplished while still providing sufficient quality to users with few labeled instances.

Web search

Using boosted decision trees, one can enable implicit data sharing and regularization. This learning method can be used on web-search ranking data sets. One example is to use ranking data sets from several countries. Here, multitask learning is particularly helpful as data sets from different countries vary largely in size because of the cost of editorial judgments. It has been demonstrated that learning various tasks jointly can lead to significant improvements in performance with surprising reliability.

Software package

The Multi-Task Learning via StructurAl Regularization (MALSAR) Matlab package implements the following multi-task learning algorithms: * Mean-Regularized Multi-Task Learning * Multi-Task Learning with Joint Feature Selection * Robust Multi-Task Feature Learning * Trace-Norm Regularized Multi-Task Learning * Alternating Structural Optimization * Incoherent Low-Rank and Sparse Learning * Robust Low-Rank Multi-Task Learning * Clustered Multi-Task LearningZhou, J., Chen, J., & Ye, J. (2011)
Clustered multi-task learning via alternating structure optimization
Advances in Neural Information Processing Systems. * Multi-Task Learning with Graph Structures

References

{{Reflist

External links

The Biosignals Intelligence Group at UIUC

Software

* ttps://web.archive.org/web/20131224113826/http://klcl.pku.edu.cn/member/sunxu/code.htm Online Multi-Task Learning Toolkit (OMT)A general-purpose online multi-task learning toolkit based on

conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...

models and stochastic gradient descent training ( C#, .NET) Machine learning