Rough Map
1. Initial Prediction
l o g ( o d d s ) log(odds) log(odds) caculated from the training data➡️probability
Come up an optimial initial prediction g a m m a = l o g ( o d d s ) gamma=log(odds) gamma=log(odds) to minimize F 0 ( x ) F_0(x) F0(x)
F 0 ( x ) = argmin γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0(x)=γargmini=1∑nL(yi,γ)
To make the calculation easier, just replace g a m m a gamma gamma as p p p and get p p p first.
Probability is used to get the residual
2. Train a Tree
In practice people often set the maximum number of leaves to be between 8 and 32.
Loss Function
The log(likelihood) of the observed data given the prediction:(used in logistic regression)
∑ i = 1 N y i × log ( p i ) + ( 1 − y i ) × log ( 1 − p i ) \sum_{i=1}^{N} y_{i} \times \log (p_i)+\left(1-y_{i}\right) \times \log (1-p_i) i=1∑Nyi×log(pi)+(1−yi)×log(1−pi)
The better the prediction, the larger the log(likelihood). (try to maximize it)
So to use the log(likelihood) as loss function that we want to minimize, just multiply it with -1
This is the function of p p p, and we convert it into the function of l o g ( o d d s ) log(odds) log(odds)− y i × log ( odds ) + log ( 1 + e log ( odds ) ) -y_i\times \log (\text {odds})+\log \left(1+e^{\log (\text{odds})}\right) −yi×log(odds)+log(1+elog(odds))
Take the derivative of loss function with respect to the predicted l o g ( o d d s ) log(odds) log(odds)y i + e log(odds ) 1 + e log ( odds ) = − y i + p y_i+\frac{e^{\text {log(odds })}}{1+e^{\log (\text { odds })}} =-y_i+p yi+1+elog( odds )elog(odds )=−yi+p
So this loss function is differentiable.Details for building a tree
For m m m in 1... M 1...M 1...M,
Calculate residuals for each sample
r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) = ( y i − e log(odds ) 1 + e log(odds ) ) r_{im}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)}=\left(y_i-\frac{e^{\text {log(odds })}}{1+e^{\text {log(odds })}}\right) rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)=(yi−1+elog(odds )elog(odds ))
Multiplying -1 with the derivative of loss function to l o g ( o d d s ) log(odds) log(odds).
i i i is the sample number and m m m is the tree that we are building.
F m − 1 ( x ) F_{m-1}(x) Fm−1(x) is the most recent predicted l o g ( o d d s ) log(odds) log(odds)
After simplification, y i − p = Pseudo Residual y_i-p=\text {Pseudo Residual} yi−p=Pseudo Residual
Fit a regression tree to the r i m r_{im} rim values and create terminal regions R j m R_{jm} Rjm, for j = 1... J m j = 1...J_m j=1...Jm
J m J_m Jm is the number of leaves.For j = 1... J m j=1...J_m j=1...Jm, compute the output value for each leaf
γ j m = a r g m i n γ ∑ x ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=argmin_{\gamma}\sum_{x\in\R_{ij}}L(y_i, F_m-1(x_i)+\gamma) γjm=argminγ∑x∈RijL(yi,Fm−1(xi)+γ)
In this formula we need to calculate γ \gamma γ, but taking the first order derivative and calculating γ \gamma γ is too hard, So we approximate the loss function with a secnod order Taylor Polynomial.
If there is only one sample in the leaf:
L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ L ( y 1 , F m − i ( x 1 ) ) + d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) γ + 1 2 d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ 2 L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx L\left(y_{1}, F_{m-i}\left(x_{1}\right)\right)+\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma+\frac{1}{2} \frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma^{2} L(y1,Fm−1(x1)+γ)≈L(y1,Fm−i(x1))+dF()d(y1,Fm−1(x1))γ+21dF()2d2(y1,Fm−1(x1))γ2
Take the first order derivative of γ \gamma γ
d d γ L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) + d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ \frac{d}{d \gamma} L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx \frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)+\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma dγdL(y1,Fm−1(x1)+γ)≈dF()d(y1,Fm−1(x1))+dF()2d2(y1,Fm−1(x1))γ
Solve for γ \gamma γ
γ = − d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) \gamma=\frac{-\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)}{\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)} γ=dF()2d2(y1,Fm−1(x1))−dF()d(y1,Fm−1(x1))
After simplification, γ 1 , 1 = Residual p × ( 1 − p ) \gamma_{1,1}=\frac{\text { Residual }}{p \times(1-p)} γ1,1=p×(1−p) Residual
Update predictions for each sample , F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm(x)=Fm−1(x)+ν∑j=1JmγjmI(x∈Rjm)
The output value of each leaf is calculated from the following formula. Since the predictions are in terms of l o g ( o d d s ) log(odds) log(odds). So the transformations has to be made.
∑ Residual i ∑ [ Previous Probability i × ( 1 − Previous Probability i ) ] \frac{\sum \text { Residual }_{i}}{\sum\left[\text { Previous Probability }_{i} \times\left(1-\text { Previous Probability }_{i}\right)\right]} ∑[ Previous Probability i×(1− Previous Probability i)]∑ Residual i
3. Make Predictions
Update l o g ( o d d s ) log(odds) log(odds) with learning rate and output value of the leaf, convert it into probability and new residual can be calculated, then the next tree can be built.
Compare the probability result with threshold 0.5 to decide which class the sample belongs to.
RegressionLoss Function
1 2 ( O b s e r v e d − P r e d i c t e d ) 2 \frac{1}{2} (Observed - Predicted) ^{2} 21(Observed−Predicted)2
1 2 \frac{1}{2} 21 is to make the calculation of derivative easier.
1. Initialization
Average target value in training set
F 0 ( x ) = argmin γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0(x)=γargmin∑i=1nL(yi,γ)
Take the first order derivative of it and get the average value given the above loss function.
2. Build Trees
For m = 1 , 2... M : m=1,2...M: m=1,2...M:
Use learning rate to scale the contribution from the new tree to prevent overfitting = taking lots of small steps in the right direction results in better predictions with a test set(low variance)
Compute Residuals r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{i m}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)} rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x) for i = 1 , … , n i=1, \ldots, n i=1,…,n
This is the residual O b s e r v e d − P r e d i c t e d Observed-Predicted Observed−Predicted, the gradient is used here that Gradient Boost is named afterPseudo residual ( r i m (r_{im} (rim) is the difference between observed values and the predicted value. Just a different say.(When we use another loss function, such as without 1 2 \frac{1}{2} 21, then it s gradient is similar to residual then is called pseudo residual)Fit a regression tree to the r i m r_{im} rim values and create terminal regions $R_{jm}
For j = 1... J m j=1...J_m j=1...Jm, compute output value of each leaf. γ j m = argmin γ ∑ x i ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{j m}=\underset{\gamma}{\operatorname{argmin}} \sum_{x_{i} \in R_{i j}} L\left(y_{i}, F_{m-1}\left(x_{i}\right)+\gamma\right) γjm=γargmin∑xi∈RijL(yi,Fm−1(xi)+γ)
The output value of each leaf given this loss function is the average of the residuals on that leaf.
Update F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm(x)=Fm−1(x)+ν∑j=1JmγjmI(x∈Rjm)
3. Make Predictions