Structured sparsity regularization is a class of methods, and an area of research in
statistical learning theory
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on d ...
, that extend and generalize sparsity regularization learning methods.
Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable
(i.e., response, or
dependent variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or dema ...
) to be learned can be described by a reduced number of variables in the input space
(i.e., the
domain
Domain may refer to:
Mathematics
*Domain of a function, the set of input values for which the (total) function is defined
** Domain of definition of a partial function
**Natural domain of a partial function
**Domain of holomorphy of a function
*Do ...
, space of
features
Feature may refer to:
Computing
* Feature (CAD), could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (software design) is an intentional distinguishing characteristic of a software item ...
or
explanatory variables
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
). ''Sparsity regularization methods'' focus on selecting the input variables that best describe the output. ''Structured sparsity regularization methods'' generalize and extend sparsity regularization methods, by allowing for optimal selection over structures like groups or networks of input variables in
.
Common motivation for the use of structured sparsity methods are model interpretability,
high-dimensional learning (where dimensionality of
may be higher than the number of observations
), and reduction of
computational complexity
In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations) ...
.
Moreover, structured sparsity methods allow to incorporate prior assumptions on the structure of the input variables, such as overlapping groups,
non-overlapping groups, and acyclic graphs.
Examples of uses of structured sparsity methods include face recognition,
magnetic resonance image (MRI) processing,
socio-linguistic analysis in natural language processing,
and analysis of genetic expression in breast cancer.
Definition and related concepts
Sparsity regularization
Consider the linear kernel
regularized empirical risk minimization
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an a ...
problem with a loss function
and the
"norm" as the regularization penalty:
:
where
, and
denotes the
"norm", defined as the number of nonzero entries of the vector
.
is said to be sparse if
. Which means that the output
can be described by a small subset of input variables.
More generally, assume a dictionary
with
is given, such that the target function
of a learning problem can be written as:
:
,
The
norm
as the number of non-zero components of
is defined as
:
, where
is the cardinality of set
.
is said to be sparse if
.
However, while using the
norm for regularization favors sparser solutions, it is computationally difficult to use and additionally is not convex. A computationally more feasible norm that favors sparser solutions is the
norm; this has been shown to still favor sparser solutions and is additionally convex.
Structured sparsity regularization
Structured sparsity regularization extends and generalizes the variable selection problem that characterizes sparsity regularization.
Consider the above
regularized empirical risk minimization
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an a ...
problem with a general kernel and associated feature map
with
.
:
The regularization term
penalizes each
component independently, which means that the algorithm will suppress input variables independently from each other.
In several situations we may want to impose more structure in the regularization process, so that, for example, input variables are suppressed according to predefined groups. Structured sparsity regularization methods allow to impose such structure by adding structure to the norms defining the regularization term.
Structures and norms
Non-overlapping groups: group Lasso
The non-overlapping group case is the most basic instance of structured sparsity. In it, an ''a priori'' partition of the coefficient vector
in
non-overlapping groups is assumed. Let
be the vector of coefficients in group
, we can define a regularization term and its group norm as
:
,
where
is the group
norm
,
is group
, and
is the ''j-th'' component of group
.
The above norm is also referred to as group Lasso.
This regularizer will force entire coefficient groups towards zero, rather than individual coefficients. As the groups are non-overlapping, the set of non-zero coefficients can be obtained as the union of the groups that were not set to zero, and conversely for the set of zero coefficients.
Overlapping groups
Overlapping groups is the structure sparsity case where a variable can belong to more than one group
. This case is often of interest as it can represent a more general class of relationships among variables than non-overlapping groups can, such as tree structures or other type of graphs.
There are two types of overlapping group sparsity regularization approaches, which are used to model different types of input variable relationships:
Intersection of complements: group Lasso
The ''intersection of complements'' approach is used in cases when we want to select only those input variables that have positive coefficients in all groups they belong to. Consider again the group Lasso for a
regularized empirical risk minimization
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an a ...
problem:
:
,
where
is the group
norm,
is group
, and
is the ''j-th'' component of group
.
As in the non-overlapping groups case, the ''group Lasso'' regularizer will potentially set entire groups of coefficients to zero. Selected variables are those with coefficients
. However, as in this case groups may overlap, we take the intersection of the complements of those groups that are not set to zero.
This ''intersection of complements'' selection criteria implies the modeling choice that we allow some coefficients within a particular group
to be set to zero, while others within the same group
may remain positive. In other words, coefficients within a group may differ depending on the several group memberships that each variable within the group may have.
Union of groups: latent group Lasso
A different approach is to consider union of groups for variable selection. This approach captures the modeling situation where variables can be selected as long as they belong at least to one group with positive coefficients. This modeling perspective implies that we want to preserve group structure.
The formulation of the union of groups approach is also referred to as latent group Lasso, and requires to modify the group
norm considered above and introduce the following regularizer
:
where
,
is the vector of coefficients of group g, and
is a vector with coefficients
for all variables
in group
, and
in all others, i.e.,
if
in group
and
otherwise.
This regularizer can be interpreted as effectively replicating variables that belong to more than one group, therefore conserving group structure. As intended by the union of groups approach, requiring
produces a vector of weights w that effectively sums up the weights of all variables across all groups they belong to.
Issues with Group Lasso regularization and alternative approaches
The objective function using group lasso consists of an error function, which is generally required to be convex but not necessarily strongly convex, and a group
regularization term. An issue with this objective function is that it is convex but not necessarily strongly convex, and thus generally does not lead to unique solutions.
An example of a way to fix this is to introduce the squared
norm of the weight vector as an additional regularization term while keeping the
regularization term from the group lasso approach.
If the coefficient of the squared
norm term is greater than
, then because the squared
norm term is strongly convex, the resulting objective function will also be strongly convex.
Provided that the
coefficient is suitably small but still positive, the weight vector minimizing the resulting objective function is generally very close to a weight vector that minimizes the objective function that would result from removing the group
regularization term altogether from the original objective function; the latter scenario corresponds to the group Lasso approach.
Thus this approach allows for simpler optimization while maintaining sparsity.
Norms based on the structure over Input variables
''See:
Submodular set function
In mathematics, a submodular set function (also known as a submodular function) is a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an ...
''
Besides the norms discussed above, other norms used in structured sparsity methods include hierarchical norms and norms defined on grids. These norms arise from submodular functions and allow the incorporation of prior assumptions on the structure of the input variables. In the context of hierarchical norms, this structure can be represented as a
directed acyclic graph
In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one v ...
over the variables while in the context of grid-based norms, the structure can be represented using a grid.
Hierarchical Norms
''See:''
Unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
Unsupervised learning methods are often used to learn the parameters of
latent variable model
A latent variable model is a statistical model that relates a set of observable variables (also called ''manifest variables'' or ''indicators'') to a set of latent variables.
It is assumed that the responses on the indicators or manifest variabl ...
s. Latent variable models are statistical models where in addition to the observed variables, a set of latent variables also exists which is not observed. Often in such models, "hierarchies" are assumed between the variables of the system; this system of hierarchies can be represented using directed acyclic graphs.
Hierarchies of latent variables have emerged as a natural structure in several applications, notably to model text documents.
[Bengio, Y. "Learning deep architectures for AI". Foundations and Trends in Machine Learning, 2(1), 2009.] Hierarchical models using Bayesian non-parametric methods have been used to learn
topic model
In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...
s,
[Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.] which are statistical models for discovering the abstract "topics" that occur in a collection of documents. Hierarchies have also been considered in the context of kernel methods.
Hierarchical norms have been applied to bioinformatics,
[S. Kim and E. Xing. Tree-guided group Lasso for multi-task regression with structured sparsity. In Proc. ICML, 2010.] computer vision and topic models.
[R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In Proc. ICML, 2010.]
Norms defined on grids
If the structure assumed over variables is in the form of a 1D, 2D or 3D grid, then submodular functions based on overlapping groups can be considered as norms, leading to stable sets equal to rectangular or convex shapes.
Such methods have applications in computer vision
[R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In ''Proc. AISTATS'', 2009.]
Algorithms for computation
Best subset selection problem
The problem of choosing the best subset of input variables can be naturally formulated under a penalization framework as:
[L. Rosasco. Lecture 10 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2014. Available at https://www.mit.edu/~9.520/fall14/slides/class18/class18_sparsity.pdf]
:
Where
denotes the
"norm", defined as the number of nonzero entries of the vector
.
Although this formulation makes sense from a modeling perspective, it is computationally unfeasible, as it is equivalent to an exhaustive search evaluating all possible subsets of variables.
Two main approaches for solving the optimization problem are: 1) greedy methods, such as
step-wise regression in statistics, or
matching pursuit
Matching pursuit (MP) is a sparse approximation algorithm which finds the "best matching" projections of multidimensional data onto the span of an over-complete (i.e., redundant) dictionary D. The basic idea is to approximately represent a signal ...
in
signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing '' signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...
; and 2) convex relaxation formulation approaches and
proximal gradient optimization methods.
Convex relaxation
A natural approximation for the best subset selection problem is the
norm regularization:
:
Such a scheme is called
basis pursuit
Basis pursuit is the mathematical optimization problem of the form
: \min_x \, x\, _1 \quad \text \quad y = Ax,
where ''x'' is a ''N''-dimensional solution vector (signal), ''y'' is a ''M''-dimensional vector of observations (measurements), ''A' ...
or the
Lasso
A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish a ...
, which substitutes the
"norm" for the convex, non-differentiable
norm.
Proximal gradient methods
Proximal gradient methods
Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems.
Many interesting problems can be formulated as convex optimization problems of the form
\operatorname\limits_ \sum_^n ...
, also called forward-backward splitting, are optimization methods useful for minimizing functions with a
convex
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytop ...
and
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point i ...
component, and a convex potentially non-differentiable component.
As such, proximal gradient methods are useful for solving sparsity and structured sparsity regularization problems
of the following form:
:
Where
is a convex and differentiable
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "co ...
like the
quadratic loss, and
is a convex potentially non-differentiable regularizer such as the
norm.
Connections to Other Areas of Machine Learning
Connection to Multiple Kernel Learning
Structured Sparsity regularization can be applied in the context of
multiple kernel learning
Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a ...
.
Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm.
In the algorithms mentioned above, a whole space was taken into consideration at once and was partitioned into groups, i.e. subspaces. A complementary point of view is to consider the case in which distinct spaces are combined to obtain a new one. It is useful to discuss this idea considering finite dictionaries. Finite dictionaries with linearly independent elements - these elements are also known as atoms - refer to finite sets of linearly independent basis functions, the linear combinations of which define hypothesis spaces. Finite dictionaries can be used to define specific kernels, as will be shown.
Assume for this example that rather than only one dictionary, several finite dictionaries are considered.
For simplicity, the case in which there are only two dictionaries
and
where
and
are integers, will be considered. The atoms in
as well as the atoms in
are assumed to be linearly independent. Let
be the union of the two dictionaries. Consider the linear space of functions
given by linear combinations of the form
for some coefficient vectors
, where
. Assume the atoms in
to still be linearly independent, or equivalently, that the map
is one to one. The functions in the space
can be seen as the sums of two components, one in the space
, the linear combinations of atoms in
and one in
, the linear combinations of the atoms in
.
One choice of norm on this space is
. Note that we can now view
as a function space in which
,
are subspaces. In view of the linear independence assumption,
can be identified with
and
with
respectively. The norm mentioned above can be seen as the group norm in
associated to the subspaces
,
, providing a connection to structured sparsity regularization.
Here,
,
and
can be seen to be the reproducing kernel Hilbert spaces with corresponding feature maps
, given by
,
, given by
, and
, given by the concatenation of
, respectively.
In the structured sparsity regularization approach to this scenario, the relevant groups of variables which the group norms consider correspond to the subspaces
and
. This approach promotes setting the groups of coefficients corresponding to these subspaces to zero as opposed to only individual coefficients, promoting sparse multiple kernel learning.
The above reasoning directly generalizes to any finite number of dictionaries, or feature maps. It can be extended to feature maps inducing infinite dimensional hypothesis
spaces.
When Sparse Multiple Kernel Learning is useful
Considering sparse multiple kernel learning is useful in several situations including the following:
* Data fusion: When each kernel corresponds to a different kind of modality/feature.
* Nonlinear variable selection: Consider kernels
depending only one dimension of the input.
Generally sparse multiple kernel learning is particularly useful when there are many kernels and model selection and interpretability are important.
Additional uses and applications
Structured sparsity regularization methods have been used in a number of settings where it is desired to impose an ''a priori'' input variable structure to the regularization process. Some such applications are:
*
Compressive sensing
Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a signal processing technique for efficiently acquiring and reconstructing a Signal (electronics), signal, by finding solutions to Underdetermined ...
in
magnetic resonance imaging (MRI), reconstructing MR images from a small number of measurements, potentially yielding significant reductions in MR scanning time
* Robust
face recognition
A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, an ...
in the presence of misalignment, occlusion and illumination variation
* Uncovering
socio-linguistic
Sociolinguistics is the descriptive study of the effect of any or all aspects of society, including cultural norms, expectations, and context, on the way language is used, and society's effect on language. It can overlap with the sociology of l ...
associations between lexical frequencies used by Twitter authors, and the socio-demographic variables of their geographic communities
* Gene selection analysis of breast cancer data using priors of overlapping groups, e.g., biologically meaningful gene sets
See also
*
Statistical learning theory
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on d ...
*
Regularization
Regularization may refer to:
* Regularization (linguistics)
* Regularization (mathematics)
* Regularization (physics)
* Regularization (solid modeling)
* Regularization Law, an Israeli law intended to retroactively legalize settlements
See also ...
*
Sparse approximation Sparse approximation (also known as sparse representation) theory deals with sparse solutions for systems of linear equations. Techniques for finding these solutions and exploiting them in applications have found wide use in image processing, signal ...
*
Proximal gradient method
Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems.
Many interesting problems can be formulated as convex optimization problems of the form
\operatorname\limits_ \sum_^n ...
s
*
Convex analysis
Convex analysis is the branch of mathematics devoted to the study of properties of convex functions and convex sets, often with applications in convex minimization, a subdomain of optimization theory.
Convex sets
A subset C \subseteq X of ...
*
Feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
References
{{reflist
Machine learning
First order methods
First or 1st is the ordinal form of the number one (#1).
First or 1st may also refer to:
*World record, specifically the first instance of a particular achievement
Arts and media Music
* 1$T, American rapper, singer-songwriter, DJ, and rec ...
Convex optimization