【机器学习算法】梯度提升方法

运用机器学习算法解决实际问题 #生活技巧# #工作学习技巧# #数字技能提升#

【机器学习算法】梯度提升方法

Comparison with Adabost Gradient BoostAdaBoostInitialbuild a very short tree called stumpmake a single leaf, representing the initial guess for the target value of all samplesLoopbuild another stump based on the errors made by the previous stumpbuild a tree based on the errors made by the previous tree(usually larger than the stump)Scalethe amount of say that the stump has on the final output is based on how well it compensated for those previous errorsscales all trees by the same amount with lerning rate Classification

Rough Map
在这里插入图片描述
1. Initial Prediction

l o g ( o d d s ) log(odds) log(odds) caculated from the training data➡️probability

Come up an optimial initial prediction g a m m a = l o g ( o d d s ) gamma=log(odds) gamma=log(odds) to minimize F 0 ( x ) F_0(x) F0(x)
F 0 ( x ) = argmin ⁡ γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0(x)=γargmini=1∑nL(yi,γ)

To make the calculation easier, just replace g a m m a gamma gamma as p p p and get p p p first.

Probability is used to get the residual

2. Train a Tree

In practice people often set the maximum number of leaves to be between 8 and 32.

Loss Function

The log(likelihood) of the observed data given the prediction:(used in logistic regression)
∑ i = 1 N y i × log ⁡ ( p i ) + ( 1 − y i ) × log ⁡ ( 1 − p i ) \sum_{i=1}^{N} y_{i} \times \log (p_i)+\left(1-y_{i}\right) \times \log (1-p_i) i=1∑Nyi×log(pi)+(1−yi)×log(1−pi)
The better the prediction, the larger the log(likelihood). (try to maximize it)

So to use the log(likelihood) as loss function that we want to minimize, just multiply it with -1

This is the function of p p p, and we convert it into the function of l o g ( o d d s ) log(odds) log(odds)

− y i × log ⁡ ( odds ) + log ⁡ ( 1 + e log ⁡ ( odds ) ) -y_i\times \log (\text {odds})+\log \left(1+e^{\log (\text{odds})}\right) −yi×log(odds)+log(1+elog(odds))

Take the derivative of loss function with respect to the predicted l o g ( o d d s ) log(odds) log(odds)

y i + e log(odds ) 1 + e log ⁡ ( odds ) = − y i + p y_i+\frac{e^{\text {log(odds })}}{1+e^{\log (\text { odds })}} =-y_i+p yi+1+elog( odds )elog(odds )=−yi+p

So this loss function is differentiable.

Details for building a tree

For m m m in 1... M 1...M 1...M,

Calculate residuals for each sample

r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) = ( y i − e log(odds ) 1 + e log(odds ) ) r_{im}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)}=\left(y_i-\frac{e^{\text {log(odds })}}{1+e^{\text {log(odds })}}\right) rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)=(yi−1+elog(odds )elog(odds ))

Multiplying -1 with the derivative of loss function to l o g ( o d d s ) log(odds) log(odds).

i i i is the sample number and m m m is the tree that we are building.

F m − 1 ( x ) F_{m-1}(x) Fm−1(x) is the most recent predicted l o g ( o d d s ) log(odds) log(odds)

After simplification, y i − p = Pseudo Residual y_i-p=\text {Pseudo Residual} yi−p=Pseudo Residual

Fit a regression tree to the r i m r_{im} rim values and create terminal regions R j m R_{jm} Rjm, for j = 1... J m j = 1...J_m j=1...Jm

J m J_m Jm is the number of leaves.

For j = 1... J m j=1...J_m j=1...Jm, compute the output value for each leaf

γ j m = a r g m i n γ ∑ x ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=argmin_{\gamma}\sum_{x\in\R_{ij}}L(y_i, F_m-1(x_i)+\gamma) γjm=argminγ∑x∈RijL(yi,Fm−1(xi)+γ)

In this formula we need to calculate γ \gamma γ, but taking the first order derivative and calculating γ \gamma γ is too hard, So we approximate the loss function with a secnod order Taylor Polynomial.

If there is only one sample in the leaf:

L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ L ( y 1 , F m − i ( x 1 ) ) + d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) γ + 1 2 d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ 2 L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx L\left(y_{1}, F_{m-i}\left(x_{1}\right)\right)+\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma+\frac{1}{2} \frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma^{2} L(y1,Fm−1(x1)+γ)≈L(y1,Fm−i(x1))+dF()d(y1,Fm−1(x1))γ+21dF()2d2(y1,Fm−1(x1))γ2

Take the first order derivative of γ \gamma γ

d d γ L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) + d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ \frac{d}{d \gamma} L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx \frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)+\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma dγdL(y1,Fm−1(x1)+γ)≈dF()d(y1,Fm−1(x1))+dF()2d2(y1,Fm−1(x1))γ

Solve for γ \gamma γ

γ = − d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) \gamma=\frac{-\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)}{\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)} γ=dF()2d2(y1,Fm−1(x1))−dF()d(y1,Fm−1(x1))

After simplification, γ 1 , 1 = Residual p × ( 1 − p ) \gamma_{1,1}=\frac{\text { Residual }}{p \times(1-p)} γ1,1=p×(1−p) Residual

Update predictions for each sample , F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm(x)=Fm−1(x)+ν∑j=1JmγjmI(x∈Rjm)

The output value of each leaf is calculated from the following formula. Since the predictions are in terms of l o g ( o d d s ) log(odds) log(odds). So the transformations has to be made.

∑ Residual i ∑ [ Previous Probability i × ( 1 − Previous Probability i ) ] \frac{\sum \text { Residual }_{i}}{\sum\left[\text { Previous Probability }_{i} \times\left(1-\text { Previous Probability }_{i}\right)\right]} ∑[ Previous Probability i×(1− Previous Probability i)]∑ Residual i

3. Make Predictions

Update l o g ( o d d s ) log(odds) log(odds) with learning rate and output value of the leaf, convert it into probability and new residual can be calculated, then the next tree can be built.

Compare the probability result with threshold 0.5 to decide which class the sample belongs to.

Regression

Loss Function

1 2 ( O b s e r v e d − P r e d i c t e d ) 2 \frac{1}{2} (Observed - Predicted) ^{2} 21(Observed−Predicted)2

1 2 \frac{1}{2} 21 is to make the calculation of derivative easier.

1. Initialization

Average target value in training set

F 0 ( x ) = argmin ⁡ γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0(x)=γargmin∑i=1nL(yi,γ)

Take the first order derivative of it and get the average value given the above loss function.

2. Build Trees

For m = 1 , 2... M : m=1,2...M: m=1,2...M:

Use learning rate to scale the contribution from the new tree to prevent overfitting = taking lots of small steps in the right direction results in better predictions with a test set(low variance)

Compute Residuals r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{i m}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)} rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x) for i = 1 , … , n i=1, \ldots, n i=1,…,n

This is the residual O b s e r v e d − P r e d i c t e d Observed-Predicted Observed−Predicted, the gradient is used here that Gradient Boost is named afterPseudo residual ( r i m (r_{im} (rim) is the difference between observed values and the predicted value. Just a different say.(When we use another loss function, such as without 1 2 \frac{1}{2} 21, then it s gradient is similar to residual then is called pseudo residual)

Fit a regression tree to the r i m r_{im} rim values and create terminal regions $R_{jm}

For j = 1... J m j=1...J_m j=1...Jm, compute output value of each leaf. γ j m = argmin ⁡ γ ∑ x i ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{j m}=\underset{\gamma}{\operatorname{argmin}} \sum_{x_{i} \in R_{i j}} L\left(y_{i}, F_{m-1}\left(x_{i}\right)+\gamma\right) γjm=γargmin∑xi∈RijL(yi,Fm−1(xi)+γ)

The output value of each leaf given this loss function is the average of the residuals on that leaf.

Update F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm(x)=Fm−1(x)+ν∑j=1JmγjmI(x∈Rjm)

3. Make Predictions

【机器学习算法】梯度提升方法

【机器学习算法】梯度提升方法

相关内容

随便看看

最新动态分享

热点动态分享

专题

推荐动态分享