【机器学习算法】梯度提升方法

发布时间:2024-12-13 16:24

运用机器学习算法解决实际问题 #生活技巧# #工作学习技巧# #数字技能提升#

【机器学习算法】梯度提升方法

最新推荐文章于 2024-07-12 17:52:39 发布

Kendyu 于 2021-01-17 16:59:50 发布

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。

Comparison with Adabost Gradient BoostAdaBoostInitialbuild a very short tree called stumpmake a single leaf, representing the initial guess for the target value of all samplesLoopbuild another stump based on the errors made by the previous stumpbuild a tree based on the errors made by the previous tree(usually larger than the stump)Scalethe amount of say that the stump has on the final output is based on how well it compensated for those previous errorsscales all trees by the same amount with lerning rate Classification

Rough Map
在这里插入图片描述
1. Initial Prediction

l o g ( o d d s ) log(odds) log(odds) caculated from the training data➡️probability

Come up an optimial initial prediction g a m m a = l o g ( o d d s ) gamma=log(odds) gamma=log(odds) to minimize F 0 ( x ) F_0(x) F0​(x)
F 0 ( x ) = argmin ⁡ γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0​(x)=γargmin​i=1∑n​L(yi​,γ)

To make the calculation easier, just replace g a m m a gamma gamma as p p p and get p p p first.

Probability is used to get the residual

2. Train a Tree

In practice people often set the maximum number of leaves to be between 8 and 32.

Loss Function

The log(likelihood) of the observed data given the prediction:(used in logistic regression)
∑ i = 1 N y i × log ⁡ ( p i ) + ( 1 − y i ) × log ⁡ ( 1 − p i ) \sum_{i=1}^{N} y_{i} \times \log (p_i)+\left(1-y_{i}\right) \times \log (1-p_i) i=1∑N​yi​×log(pi​)+(1−yi​)×log(1−pi​)
The better the prediction, the larger the log(likelihood). (try to maximize it)

So to use the log(likelihood) as loss function that we want to minimize, just multiply it with -1

This is the function of p p p, and we convert it into the function of l o g ( o d d s ) log(odds) log(odds)

− y i × log ⁡ ( odds ) + log ⁡ ( 1 + e log ⁡ ( odds ) ) -y_i\times \log (\text {odds})+\log \left(1+e^{\log (\text{odds})}\right) −yi​×log(odds)+log(1+elog(odds))

Take the derivative of loss function with respect to the predicted l o g ( o d d s ) log(odds) log(odds)

y i + e log(odds  ) 1 + e log ⁡ (  odds  ) = − y i + p y_i+\frac{e^{\text {log(odds })}}{1+e^{\log (\text { odds })}} =-y_i+p yi​+1+elog( odds )elog(odds )​=−yi​+p

So this loss function is differentiable.

Details for building a tree

For m m m in 1... M 1...M 1...M,

Calculate residuals for each sample

r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) = ( y i − e log(odds  ) 1 + e log(odds  ) ) r_{im}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)}=\left(y_i-\frac{e^{\text {log(odds })}}{1+e^{\text {log(odds })}}\right) rim​=−[∂F(xi​)∂L(yi​,F(xi​))​]F(x)=Fm−1​(x)​=(yi​−1+elog(odds )elog(odds )​)

Multiplying -1 with the derivative of loss function to l o g ( o d d s ) log(odds) log(odds).

i i i is the sample number and m m m is the tree that we are building.

F m − 1 ( x ) F_{m-1}(x) Fm−1​(x) is the most recent predicted l o g ( o d d s ) log(odds) log(odds)

After simplification, y i − p = Pseudo Residual y_i-p=\text {Pseudo Residual} yi​−p=Pseudo Residual

Fit a regression tree to the r i m r_{im} rim​ values and create terminal regions R j m R_{jm} Rjm​, for j = 1... J m j = 1...J_m j=1...Jm​

J m J_m Jm​ is the number of leaves.

For j = 1... J m j=1...J_m j=1...Jm​, compute the output value for each leaf

γ j m = a r g m i n γ ∑ x ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=argmin_{\gamma}\sum_{x\in\R_{ij}}L(y_i, F_m-1(x_i)+\gamma) γjm​=argminγ​∑x∈Rij​​L(yi​,Fm​−1(xi​)+γ)

In this formula we need to calculate γ \gamma γ, but taking the first order derivative and calculating γ \gamma γ is too hard, So we approximate the loss function with a secnod order Taylor Polynomial.

If there is only one sample in the leaf:

L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ L ( y 1 , F m − i ( x 1 ) ) + d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) γ + 1 2 d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ 2 L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx L\left(y_{1}, F_{m-i}\left(x_{1}\right)\right)+\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma+\frac{1}{2} \frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma^{2} L(y1​,Fm−1​(x1​)+γ)≈L(y1​,Fm−i​(x1​))+dF()d​(y1​,Fm−1​(x1​))γ+21​dF()2d2​(y1​,Fm−1​(x1​))γ2

Take the first order derivative of γ \gamma γ

d d γ L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) + d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ \frac{d}{d \gamma} L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx \frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)+\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma dγd​L(y1​,Fm−1​(x1​)+γ)≈dF()d​(y1​,Fm−1​(x1​))+dF()2d2​(y1​,Fm−1​(x1​))γ

Solve for γ \gamma γ

γ = − d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) \gamma=\frac{-\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)}{\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)} γ=dF()2d2​(y1​,Fm−1​(x1​))−dF()d​(y1​,Fm−1​(x1​))​

After simplification, γ 1 , 1 =  Residual  p × ( 1 − p ) \gamma_{1,1}=\frac{\text { Residual }}{p \times(1-p)} γ1,1​=p×(1−p) Residual ​

Update predictions for each sample , F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm​(x)=Fm−1​(x)+ν∑j=1Jm​​γjm​I(x∈Rjm​)

The output value of each leaf is calculated from the following formula. Since the predictions are in terms of l o g ( o d d s ) log(odds) log(odds). So the transformations has to be made.

∑  Residual  i ∑ [  Previous Probability  i × ( 1 −  Previous Probability  i ) ] \frac{\sum \text { Residual }_{i}}{\sum\left[\text { Previous Probability }_{i} \times\left(1-\text { Previous Probability }_{i}\right)\right]} ∑[ Previous Probability i​×(1− Previous Probability i​)]∑ Residual i​​

3. Make Predictions

Update l o g ( o d d s ) log(odds) log(odds) with learning rate and output value of the leaf, convert it into probability and new residual can be calculated, then the next tree can be built.

Compare the probability result with threshold 0.5 to decide which class the sample belongs to.

Regression

Loss Function

1 2 ( O b s e r v e d − P r e d i c t e d ) 2 \frac{1}{2} (Observed - Predicted) ^{2} 21​(Observed−Predicted)2

1 2 \frac{1}{2} 21​ is to make the calculation of derivative easier.

1. Initialization

Average target value in training set

F 0 ( x ) = argmin ⁡ γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0​(x)=γargmin​∑i=1n​L(yi​,γ)

Take the first order derivative of it and get the average value given the above loss function.

2. Build Trees

For m = 1 , 2... M : m=1,2...M: m=1,2...M:

Use learning rate to scale the contribution from the new tree to prevent overfitting = taking lots of small steps in the right direction results in better predictions with a test set(low variance)

Compute Residuals r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{i m}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)} rim​=−[∂F(xi​)∂L(yi​,F(xi​))​]F(x)=Fm−1​(x)​ for i = 1 , … , n i=1, \ldots, n i=1,…,n

This is the residual O b s e r v e d − P r e d i c t e d Observed-Predicted Observed−Predicted, the gradient is used here that Gradient Boost is named afterPseudo residual ( r i m (r_{im} (rim​) is the difference between observed values and the predicted value. Just a different say.(When we use another loss function, such as without 1 2 \frac{1}{2} 21​, then it s gradient is similar to residual then is called pseudo residual)

Fit a regression tree to the r i m r_{im} rim​ values and create terminal regions $R_{jm}

For j = 1... J m j=1...J_m j=1...Jm​, compute output value of each leaf. γ j m = argmin ⁡ γ ∑ x i ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{j m}=\underset{\gamma}{\operatorname{argmin}} \sum_{x_{i} \in R_{i j}} L\left(y_{i}, F_{m-1}\left(x_{i}\right)+\gamma\right) γjm​=γargmin​∑xi​∈Rij​​L(yi​,Fm−1​(xi​)+γ)

The output value of each leaf given this loss function is the average of the residuals on that leaf.

Update F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm​(x)=Fm−1​(x)+ν∑j=1Jm​​γjm​I(x∈Rjm​)

3. Make Predictions

网址:【机器学习算法】梯度提升方法 https://www.yuejiaxmz.com/news/view/465687

相关内容

机器学习(七):提升(boosting)方法
机器学习提升秘籍:Scikit
常见的几种最优化方法(梯度下降法、牛顿法、拟牛顿法、共轭梯度法等)
学习提升方法
如何学习计算机网络——学习方法
一文看懂机器学习「3 种学习方法 + 7 个实操步骤 + 15 种常见算法」
机器学习算法应用于智能健身与健康管理投资方案
机器学习算法应用场景实例六十则
【深度学习】深度学习语音识别算法的详细解析
毕业设计:基于机器学习的食物热量卡路里估算方法 人工智能 算法 Faster R

随便看看