Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals instead of residuals as in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. As with other boosting methods, a gradient-boosted trees model is built in stages, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.
The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function.[1] Explicit regression gradient boosting algorithms were subsequently developed, by Jerome H. Friedman,[2] [3] (in 1999 and later in 2001) simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.[4] [5] The latter two papers introduced the view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.
(This section follows the exposition by Cheng Li.[6])
Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner iteratively. It is easiest to explain in the least-squares regression setting, where the goal is to "teach" a model
F
\hat{y}=F(x)
\tfrac{1}{n}\sumi(\hat{y}i-
2 | |
y | |
i) |
i
n
y
\hatyi=
F(xi)
yi=
n=
y
If the algorithm has
M
m
1\lem\leM
Fm
m
\hatyi
\bary
y
Fm
hm(x)
Fm+1(xi)=Fm(xi)+hm(xi)=yi
or, equivalently,
hm(xi)=yi-Fm(xi)
Therefore, gradient boosting will fit
hm
yi-Fm(xi)
Fm+1
Fm
hm(xi)
F(xi)
L\rm=
1 | |
n |
n | |
\sum | |
i=1 |
\left(yi-
2 | |
F(x | |
i)\right) |
-
\partialL\rm | |
\partialF(xi) |
=
2 | |
n |
(yi-F(xi))=
2 | |
n |
hm(xi)
So, gradient boosting could be generalized to a gradient descent algorithm by "plugging in" a different loss and its gradient.
Many supervised learning problems involve an output variable and a vector of input variables, related to each other with some probabilistic distribution. The goal is to find some function
\hat{F}(x)
L(y,F(x))
\hat{F}=\underset{F}{\argmin}Ex,y[L(y,F(x))]
The gradient boosting method assumes a real-valued . It seeks an approximation
\hat{F}(x)
hm(x)
l{H}
\hat{F}(x)=
M | |
\sum | |
m=1 |
\gammamhm(x)+const
We are usually given a training set
\{(x1,y1),...,(xn,yn)\}
\hat{F}(x)
F0(x)
F0(x)=\underset{hm\in
n | |
l{H}}{\argmin}{\sum | |
i=1 |
{L(yi,hm(xi))}}
Fm(x)=Fm-1(x)+\left(\underset{hm\in
n | |
l{H}}{\operatorname{argmin}}\left[{\sum | |
i=1 |
{L(yi,Fm-1(xi)+hm(xi))}}\right]\right)(x)
for
m\geq1
hm\inl{H}
Unfortunately, choosing the best function
hm
Fm-1(x)
\gamma
Fm(x)=Fm-1(x)-\gamma
n | |
\sum | |
i=1 |
{\nabla | |
Fm-1 |
L(yi,Fm-1(xi))}
where
\gamma>0
\gamma
L(yi,Fm(xi))\leL(yi,Fm-1(xi))
To prove the following, consider the objective
O=
n | |
\sum | |
i=1 |
{L(yi,Fm-1(xi)+hm(xi))}
Doing a Taylor expansion around the fixed point
Fm-1(xi)
O=
n | |
\sum | |
i=1 |
{L(yi,Fm-1(xi)+hm(xi))} ≈
n | |
\sum | |
i=1 |
{L(yi,Fm-1(xi))+hm(xi)\nabla
Fm-1 |
L(yi,Fm-1(xi))}+\ldots
Now differentiating w.r.t to
hm(xi)
\nabla | |
Fm-1 |
L(yi,Fm-1(xi))
Furthermore, we can optimize
\gamma
\gamma
\gammam=\underset{\gamma}{\argmin}
n | |
{\sum | |
i=1 |
{L(yi,Fm(xi))}}=\underset{\gamma}{\argmin}
n | |
{\sum | |
i=1 |
{L\left(yi,Fm-1(xi)- \gamma
\nabla | |
Fm-1 |
L(yi,Fm-1(xi))\right)}}.
If we considered the continuous case, i.e., where
l{H}
\R
Fm(x)=Fm-1(x)-\gammam
n | |
\sum | |
i=1 |
{\nabla | |
Fm-1 |
L(yi,Fm-1(xi))}
\gammam
l{H}
\{(xi,yi)\}
n, | |
i=1 |
L(y,F(x)),
Algorithm:
F0(x)=\underset{\gamma}{\argmin}
n | |
\sum | |
i=1 |
L(yi,\gamma).
rim=-\left[
\partialL(yi,F(xi)) | |
\partialF(xi) |
\right] | |
F(x)=Fm-1(x) |
fori=1,\ldots,n.
hm(x)
\{(xi,rim
n | |
)\} | |
i=1 |
\gammam
\gammam=\underset{\gamma}{\operatorname{argmin}}
n | |
\sum | |
i=1 |
L\left(yi,Fm-1(xi)+\gammahm(xi)\right).
Fm(x)=Fm-1(x)+\gammamhm(x).
FM(x).
Gradient boosting is typically used with decision trees (especially CARTs) of a fixed size as base learners. For this special case, Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.
Generic gradient boosting at the m-th step would fit a decision tree
hm(x)
Jm
Jm
R1m,\ldots,
R | |
Jmm |
hm(x)
hm(x)=
Jm | |
\sum | |
j=1 |
bjm
1 | |
Rjm |
(x),
bjm
Rjm
Then the coefficients
bjm
\gammam
Fm(x)=Fm-1(x)+\gammamhm(x), \gammam=\underset{\gamma}{\operatorname{argmin}}
n | |
\sum | |
i=1 |
L(yi,Fm-1(xi)+\gammahm(xi)).
Friedman proposes to modify this algorithm so that it chooses a separate optimal value
\gammajm
\gammam
bjm
Fm(x)=Fm-1(x)+
Jm | |
\sum | |
j=1 |
\gammajm
1 | |
Rjm |
(x), \gammajm=\underset{\gamma}{\operatorname{argmin}}
\sum | |
xi\inRjm |
L(yi,Fm-1(xi)+\gamma).
The number
J
J=2
J=3
J
Hastie et al.[9] comment that typically
4\leqJ\leq8
J
J=2
J>10
Fitting the training set too closely can lead to degradation of the model's generalization ability, that is, its performance on unseen examples. Several so-called regularization techniques reduce this overfitting effect by constraining the fitting procedure.
One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of base models). Increasing M reduces the error on training set, but increases risk of overfitting. An optimal value of M is often selected by monitoring prediction error on a separate validation data set.
Another regularization parameter for tree boosting is tree depth. The higher this value the more likely the model will overfit the training data.
An important part of gradient boosting is regularization by shrinkage which uses a modified update rule:
Fm(x)=Fm-1(x)+\nu ⋅ \gammamhm(x), 0<\nu\leq1,
\nu
Empirically, it has been found that using small learning rates (such as
\nu<0.1
\nu=1
Soon after the introduction of gradient boosting, Friedman proposed a minor modification to the algorithm, motivated by Breiman's bootstrap aggregation ("bagging") method. Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement.[10] Friedman observed a substantial improvement in gradient boosting's accuracy with this modification.
Subsample size is some constant fraction
f
f=1
f
0.5\leqf\leq0.8
f
Also, like in bagging, subsampling allows one to define an out-of-bag error of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.[11] [12]
Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes. It is used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances.
Imposing this limit helps to reduce variance in predictions at leaves.
Another useful regularization technique for gradient boosted model is to penalize its complexity.[13] For gradient boosted trees, model complexity can be defined as the proportional number of leaves in the trees. The joint optimization of loss and model complexity corresponds to a post-pruning algorithm to remove branches that fail to reduce the loss by a threshold.
Other kinds of regularization such as an
\ell2
Gradient boosting can be used in the field of learning to rank. The commercial web search engines Yahoo[14] and Yandex[15] use variants of gradient boosting in their machine-learned ranking engines. Gradient boosting is also utilized in High Energy Physics in data analysis. At the Large Hadron Collider (LHC), variants of gradient boosting Deep Neural Networks (DNN) were successful in reproducing the results of non-machine learning methods of analysis on datasets used to discover the Higgs boson.[16] Gradient boosting decision tree was also applied in earth and geological studies – for example quality evaluation of sandstone reservoir.[17]
The method goes by a variety of names. Friedman introduced his regression technique as a "Gradient Boosting Machine" (GBM). Mason, Baxter et al. described the generalized abstract class of algorithms as "functional gradient boosting".[4] Friedman et al. describe an advancement of gradient boosted models as Multiple Additive Regression Trees (MART);[18] Elith et al. describe that approach as "Boosted Regression Trees" (BRT).[19]
A popular open-source implementation for R calls it a "Generalized Boosting Model", however packages expanding this work use BRT.[20] Yet another name is TreeNet, after an early commercial implementation from Salford System's Dan Steinberg, one of researchers who pioneered the use of tree-based methods.[21]
Gradient boosting can be used for feature importance ranking, which is usually based on aggregating importance function of the base learners.[22] For example, if a gradient boosted trees algorithm is developed using entropy-based decision trees, the ensemble algorithm ranks the importance of features based on entropy as well with the caveat that it is averaged out over all base learners.
While boosting can increase the accuracy of a base learner, such as a decision tree or linear regression, it sacrifices intelligibility and interpretability.[23] For example, following the path that a decision tree takes to make its decision is trivial and self-explained, but following the paths of hundreds or thousands of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming an XGBoost into a single "born-again" decision tree that approximates the same decision function.[24] Furthermore, its implementation may be more difficult due to the higher computational demand.
bjm
Rjm
Rjm