Structured sparsity regularization is a class of methods, and an area of research in statistical learning theory, that extend and generalize sparsity regularization learning methods.[1] Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable
Y
X
X
Common motivation for the use of structured sparsity methods are model interpretability, high-dimensional learning (where dimensionality of
X
n
Consider the linear kernel regularized empirical risk minimization problem with a loss function
V(yi,f(x))
\ell0
min | |
w\inRd |
1 | |
n |
n | |
\sum | |
i=1 |
V(yi,\langlew,xi\rangle)+λ\|w\|0,
x,w\in
Rd |
\|w\|0
\ell0
w
f(x)=\langlew,xi\rangle
\|w\|0=s<d
Y
More generally, assume a dictionary
\phij:X → R
j=1,...,p
f(x)
f(x)=
p | |
\sum | |
j=1 |
\phij(x)wj
\forallx\inX
\ell0
\|f\|0=\|w\|0
w
\|w\|0=|\{j|wj ≠ 0,j\in\{1,...,p\}\}|
|A|
A
f
\|f\|0=\|w\|0=s<d
However, while using the
\ell0
\ell1
Structured sparsity regularization extends and generalizes the variable selection problem that characterizes sparsity regularization.[6] Consider the above regularized empirical risk minimization problem with a general kernel and associated feature map
\phij:X → R
j=1,...,p
min | |
w\inRd |
1 | |
n |
n | |
\sum | |
i=1 |
V(yi,\langlew,\Phi(xi)\rangle)+λ\|w\|0,
λ\|w\|0
wj
In several situations we may want to impose more structure in the regularization process, so that, for example, input variables are suppressed according to predefined groups. Structured sparsity regularization methods allow to impose such structure by adding structure to the norms defining the regularization term.
The non-overlapping group case is the most basic instance of structured sparsity. In it, an a priori partition of the coefficient vector
w
G
wg
g
λR(w)=λ\sum{g=1
\|wg\|g
\ell2
\|wg\|g=\sqrt{\sum{j=1
Gg
g
j | |
w | |
g |
Gg
The above norm is also referred to as group Lasso. This regularizer will force entire coefficient groups towards zero, rather than individual coefficients. As the groups are non-overlapping, the set of non-zero coefficients can be obtained as the union of the groups that were not set to zero, and conversely for the set of zero coefficients.
Overlapping groups is the structure sparsity case where a variable can belong to more than one group
g
There are two types of overlapping group sparsity regularization approaches, which are used to model different types of input variable relationships:
The intersection of complements approach is used in cases when we want to select only those input variables that have positive coefficients in all groups they belong to. Consider again the group Lasso for a regularized empirical risk minimization problem:
λR(w)=λ\sum{g=1
\|wg\|g
\ell2
Gg
g
j | |
w | |
g |
Gg
As in the non-overlapping groups case, the group Lasso regularizer will potentially set entire groups of coefficients to zero. Selected variables are those with coefficients
wj>0
This intersection of complements selection criteria implies the modeling choice that we allow some coefficients within a particular group
g
g
A different approach is to consider union of groups for variable selection. This approach captures the modeling situation where variables can be selected as long as they belong at least to one group with positive coefficients. This modeling perspective implies that we want to preserve group structure.
The formulation of the union of groups approach is also referred to as latent group Lasso, and requires to modify the group
\ell2
R(w)=inf\left\{\sumg\|w{g
w\in
{Rd |
w{g
{\bar{w}}g\in
{Rd |
j | |
w | |
g |
j
g
0
{\bar
j | |
w} | |
g |
j | |
=w | |
g |
j
g
{\bar
j | |
w} | |
g |
=0
This regularizer can be interpreted as effectively replicating variables that belong to more than one group, therefore conserving group structure. As intended by the union of groups approach, requiring
w=\sum{g=1
The objective function using group lasso consists of an error function, which is generally required to be convex but not necessarily strongly convex, and a group
\ell1
An example of a way to fix this is to introduce the squared
\ell2
\ell1
\ell2
0
\ell2
\ell2
\ell2
Besides the norms discussed above, other norms used in structured sparsity methods include hierarchical norms and norms defined on grids. These norms arise from submodular functions and allow the incorporation of prior assumptions on the structure of the input variables. In the context of hierarchical norms, this structure can be represented as a directed acyclic graph over the variables while in the context of grid-based norms, the structure can be represented using a grid.
Unsupervised learning methods are often used to learn the parameters of latent variable models. Latent variable models are statistical models where in addition to the observed variables, a set of latent variables also exists which is not observed. Often in such models, "hierarchies" are assumed between the variables of the system; this system of hierarchies can be represented using directed acyclic graphs.
Hierarchies of latent variables have emerged as a natural structure in several applications, notably to model text documents.[8] Hierarchical models using Bayesian non-parametric methods have been used to learn topic models,[9] which are statistical models for discovering the abstract "topics" that occur in a collection of documents. Hierarchies have also been considered in the context of kernel methods.[10] Hierarchical norms have been applied to bioinformatics,[11] computer vision and topic models.[12]
If the structure assumed over variables is in the form of a 1D, 2D or 3D grid, then submodular functions based on overlapping groups can be considered as norms, leading to stable sets equal to rectangular or convex shapes. Such methods have applications in computer vision[13]
The problem of choosing the best subset of input variables can be naturally formulated under a penalization framework as:[14]
min | |
w\inRd |
1 | |
n |
n | |
\sum | |
i=1 |
V(yi,w,xi)+λ\|w\|0,
\|w\|0
\ell0
w
Although this formulation makes sense from a modeling perspective, it is computationally unfeasible, as it is equivalent to an exhaustive search evaluating all possible subsets of variables.
Two main approaches for solving the optimization problem are: 1) greedy methods, such as step-wise regression in statistics, or matching pursuit in signal processing; and 2) convex relaxation formulation approaches and proximal gradient optimization methods.
A natural approximation for the best subset selection problem is the
\ell1
min | |
w\inRd |
1 | |
n |
n | |
\sum | |
i=1 |
V(yi,w,xi)+λ\|w\|1
\ell0
\ell1
See main article: Proximal gradient methods and Proximal gradient methods for learning.
Proximal gradient methods, also called forward-backward splitting, are optimization methods useful for minimizing functions with a convex and differentiable component, and a convex potentially non-differentiable component.
As such, proximal gradient methods are useful for solving sparsity and structured sparsity regularization problems[15] of the following form:
min | |
w\inRd |
1 | |
n |
n | |
\sum | |
i=1 |
V(yi,w,xi)+R(w)
V(yi,w,xi)
R(w)
\ell1
See main article: article and Multiple kernel learning.
Structured Sparsity regularization can be applied in the context of multiple kernel learning.[16] Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm.
In the algorithms mentioned above, a whole space was taken into consideration at once and was partitioned into groups, i.e. subspaces. A complementary point of view is to consider the case in which distinct spaces are combined to obtain a new one. It is useful to discuss this idea considering finite dictionaries. Finite dictionaries with linearly independent elements - these elements are also known as atoms - refer to finite sets of linearly independent basis functions, the linear combinations of which define hypothesis spaces. Finite dictionaries can be used to define specific kernels, as will be shown. Assume for this example that rather than only one dictionary, several finite dictionaries are considered.
For simplicity, the case in which there are only two dictionaries
A=\{aj:X → \R,j=1,...,p\}
B=\{bt:X → \R,t=1,...,q\}
q
p
A
B
D=\{dk:X → \R,k=1,...,p+q\}=A\cupB
H
f(x)=
p+q | |
\sum | |
i=1 |
{wjdj(x)}=
p | |
\sum | |
j=1 |
j | |
{w | |
A |
aj(x)}+
q | |
\sum | |
t=1 |
t | |
{w | |
B |
bt(x)},x\inX
for some coefficient vectors
wA\in\Rp,wB\in\Rq
w=(wA,wB)
D
w=(wA,wB)\mapstof
H
HA
A
HB
B
One choice of norm on this space is
||f||=||wA||+||wB||
H
HA
HB
H
\Rp+q
HA,HB
\Rp,\Rq
H
HA
HB
Here,
HA
HB
H
\PhiA:X → \Rp
\PhiA(x)=(a1(x),...,ap(x))
\PhiB:X → \Rq
\PhiB(x)=(b1(x),...,bq(x))
\Phi:X → \Rp+q
\PhiA,\PhiB
In the structured sparsity regularization approach to this scenario, the relevant groups of variables which the group norms consider correspond to the subspaces
HA
HB
The above reasoning directly generalizes to any finite number of dictionaries, or feature maps. It can be extended to feature maps inducing infinite dimensional hypothesis
spaces.
Considering sparse multiple kernel learning is useful in several situations including the following:
Kg
Generally sparse multiple kernel learning is particularly useful when there are many kernels and model selection and interpretability are important.
Structured sparsity regularization methods have been used in a number of settings where it is desired to impose an a priori input variable structure to the regularization process. Some such applications are: