An overview of Gradient Boosting Tree

An overview of Gradient Boosting Tree

Introduction

Gradient boosting tree is one of the most powerful and used algorithms for machine learning, especially with tabular data, and many top Kaggle solutions often use it. This post tries to explain the tricks and methods used by the most relevant works in this space.

Formally, we would like to solve the classical supervised classification or regression problem: Given a loss function \(L(y, \hat{y})\), we would like to find a function \(F(x)\) that minimize the given loss in the data distribution \(D\): \begin{equation} \underset{F}{\operatorname{argmin}} {\mathop{\mathbb{E}}_{(x, y) \sim D}{L(y, F(x))}} \end{equation}

The main idea of boosting is to create \(F\) as a linear sum of weak and simple models \(h\) (in the case of gradient boosting trees, they are decision trees). Therefore: \begin{equation} F(x) = \sum_{i=1}^{m}{h_{i}(x)} \end{equation}

Training

Given a training set \({(x_1, y_1), ..., (x_n, y_n)}\), the boosting method tries to greedily pick the next best weak model \(h\). \begin{equation} F_{m+1}(x) = F_m(x) + \underset{h_m}{\operatorname{argmin}} \sum_{i=1}^{n}{L(y_i, F_m(x_i)+h_m(x_i))} \end{equation}

This problem is generally intractable. Hence, we approximate the optimization using multiple gradient steps, which could be the gradient descent or Newton’s method. For instance, in the case of gradient descent, the formula becomes:

\begin{equation} \underset{h_m}{\operatorname{argmin}} \sum_{i=1}^{n}{L(y_i, F_m(x_i)+h_m(x_i))}
\end{equation} \begin{equation} \approx \underset{h_m}{\operatorname{argmin}} \sum_{i=1}^{n}{L(y_i, F_m(x_i)) + \nabla L(y_i, F_m(x_i)) h_m(x_i)}
\end{equation} \begin{equation} \approx \underset{h_m}{\operatorname{argmin}} \sum_{i=1}^{n}{\nabla L(y_i, F_m(x_i)) h_m(x_i)} \end{equation}

Consequently, \(h_m(x)\) should be close to \(-\nabla L(y_i, F_m(x_i))\). So one of the possible approaches is to train the decision tree on \((x_1, \nabla L(y_1, F_m(x_1)))\), \((x_2, \nabla L(y_2, F_m(x_2)))\), …, \((x_n, \nabla L(y_n, F_m(x_n)))\) using the mean squared error.

Novel methods

In this section, we compare the different implementations of gradient boosting trees and other novel improvements.

XGBoost: A Scalable Tree Boosting System

Firstly, they used two common regularization methods which are:

  1. Shrinkage scales, which means that the newly added decision tree \(h_m\) is multiplied by a factor \(\gamma\) smaller than 1, similar to gradient descent. This technique reduces the influence of each tree.
  2. Column subsampling, which is a method used in Random Forest, reduces both overfitting and improves training performance.

One of the most important contributions of this work is an approximate algorithm to find the best split. While the basic exact algorithm would try \(O(NK)\) splits with \(N\) samples and \(K\) features, the approximate algorithm tries to map the continuous features into buckets. To solve this mapping problem, this work also proposed a new weighted quantile sketch with provable theoretical guarantees.

Another significant contribution of this work is the sparsity-aware split finding. In tabular datasets is quite common to have very sparse features (missing features), thus, when splitting the algorithm must decide also whether the missing values should go to the left or the right leaf. Therefore, the splitting algorithm should select (a feature, bucket, and default branch). This method allows us to speed up the training step when there are many missing values.

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

The main aim of LightGBM is to efficiently work when the feature dimension is high, and the data size is large. Similar to XGBoost, this method uses the histogram-based algorithm for splitting.

This work proposed two methods: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to improve the training performance without hurting the accuracy of the model.

The first one, Gradient-based One-Side Sampling, notices that samples with a large gradient are more informative, while the smaller ones are less useful. However, only discarding small gradients is harmful, as this also changes the data distribution. Thus, this method keeps all the large ones and performs random sampling on the small ones. Therefore, it first sorts by the absolute value of the gradients, then it picks the top \(a*100\%\) samples and an additional \(b*100\%\) of the instances randomly from the rest of the dataset. Finally, it amplifies the small gradients by a factor of \(\frac{1-a}{b}\). Consequently, the more difficult samples are more important while not changing too much the data distribution.

The second method proposed is called Exclusive Feature Bundling, which aims to reduce the number of used features. The main idea of this approach is to group exclusive features into exclusive feature bundles, e.g. features that never take nonzero values simultaneously. However, it can be shown that the problem of partitioning features into the smallest number of exclusive bundles is NP-hard, as it is possible to reduce the graph coloring problem to it, which is known to be NP-hard. Thus, an approximate algorithm is used, additionally, it also allows a small conflict factor \(\gamma\) for non-exclusive samples. To merge a bundle of exclusive features, each feature is mapped using different ranges into a bundle feature e.g. if features \(A\), \(B\) are exclusive features, then feature \(A\) is mapped to \([0, 10)\) and features \(B\) to \([10,20)\). In this way, it reduces the number of features in an almost lossless way.

CatBoost: gradient boosting with categorical features support

CatBoost mostly focuses on how to handle categorical features and solves the gradient bias problem.

Let’s first consider how it is possible to handle categorical features:

  1. One-hot encoder: converts each categorical feature into 0-1 features for each possible value. However, this method increases the number of features and cannot handle categories with high cardinality.
  2. Greedy TS (target statistic): estimates \(\hat{x}_k^i = {\mathop{\mathbb{E}}}[y \mid x^i=x_k^i]\), which categorical feature \(i\) and the \(k\)-th training sample. The prior weight \(p\), which is generally the average across the entire dataset, is mostly useful with low-frequency features. To choose the strength of the prior a smoothing factor \(a\) is used, so the larger the smoothing factor the most important the prior becomes. \begin{equation} \hat{x}_k^i = \frac{a p+\sum_{j=1}^{n}{[x^i_j=x^i_k]y_j}}{a+\sum_{j=1}^{n}{[x^i_j=x^i_k]}} \end{equation} It is worth noting that this approach is target leakage, which may affect the generalization error of the model.
  3. Holdout TS: first split the dataset into two partitions. The first one is used to compute the Greedy TS, while the latter is used for training. Therefore, this method reduces the amount of data used for training.
  4. Leave-one-out TS: computes the TS on everything, but the specific sample. However, this approach does not prevent target leakage.
  5. Ordered TS: which is the method proposed in CatBoost, computes apply the greedy TS only using the already preprocessed samples.

Another issue that CatBoost tries to tackle is the gradient bias, which means that \(g^t(x_k, y_k)\mid x_k\) is shifted from the distribution \(g^t(x, y)\mid x\) because \((x_k, y_k)\) were already used to train \(F^t\). To solve this problem, CatBoost proposed to change how the values are assigned to each leaf while keeping the same splitting process. Therefore, firstly, the model selects the best splits, as in the previous methods, but, then it assigns the leaf values differently.

The original algorithm would assign the values in the following way: \begin{equation} leafValue = \frac{\sum_{i=0}^{n}{g_i}}{n} \end{equation} , where \(g_i\) is the gradient with respect to the \(i\)-th sample.

In the new method, every time a new tree is added, \(n\) trees are training, in particular, the \(i\)-th tree is trained on the first \(i\) samples, so it can then compute the gradient on the \(i\)-th sample \(g_i\) using the \(i\)-th model to obtain an unbiased estimate. However, this method is \(O(n^2)\) and has a high variance. The latter can be easily mitigated by repeating this step \(s\) times to reduce the final variance, while the first problem can be solved by computing only every power of 2. Therefore, the complexity is going to be \(2^1+2^3+...+2^{\lceil \log{n} \rceil} < 4n\), which is linear in \(n\). Hence, the final complexity becomes \(O(ns)\).

Finally, Catboost introduced another trick to make the base predictors faster, the oblivious trees. All these trees have the same splitting criterion at each level of the tree. Therefore, they are symmetric and less prone to overfitting. Additionally, it can encode the leaf values in an array with size \(2^d\), where \(d\) is the depth of the tree.

CatBoost also implements various bootstrapping methods:

  1. Subsampling: means that at every iteration a random subsample is picked and used for training.
  2. Bayesian bootstrapping: means that every sample is trained with a weight $w_i$, based on a specific algorithm (for more information )