10 Boosting and Additive Trees

By Bernice Wagner,2014-07-25 22:58
14 views 0
10 Boosting and Additive Trees

Boosting and Additive Trees

    Algorithm 10.1 10 Boosting and Additive Trees

    Adaboost.M1 10.1 Boosting Methods

    (1) 1. Initialize the observ. weights WNiN((1/,1,...,iIntroduction to Boosting

    Idea (in 2-class classification problem) : 2. For to Mm(1

    ()mOutputy?,{1,1}- variable coded in ; Fit classifier to the training data using weights . GxW;;mi

    - Apply sequentially a weak classification algorithm to repeatedly N()mWIyGx()?;;?iimi(i1; Compute . err(modified versions of the observations (each observation has a weight mN()mW?i(i1modified at each iteration)

    ; Compute . (,log((1)/)errerrmmm

    (1)()mm; WWIyGxiN(?(exp[(],1,...,. ;;iimimi

    MGxsignGx(()3. Compute . ;;;;?m(1i

    10.2 Boosting Fits and Additive Model

    A. Boosting fits an additive model

    where bxGx(;){1,1}?(?, (for Adaboost) is like a set of ;;mm

    elementary "basis functions"

    This model is fit by minimizing a loss function Laveraged over the


    training data set: min(,(;))(Lybx;???imimRole of {,};?mmm((11im

    sequenceGx- give more weights to the more accur ate in the 10.3 Forward Stagewise Additive Modeling (FSAM) ;;m

    B. An approximate solution to this minimization problem is found by the - modify the weights of the observations for the next step (here, more

     (Algorithm 10.2): FSAM weight to misclassied observations)

    1. Initialize fx()0( 0Most popularboosting algorithm : AdaBoost.M1

    Boosting and Additive Trees 2. For to 10.6 Loss Functions and Robustness Mm(1

    NA. Exponentional loss ? Binomial negative log-likelihood Compute (,)argmin(,()(;));?;?(?Lyfxbx?;?,1mmimii(1PYx(1|)(i* Till now fxEYfx()argmin(exp(()))1/2log(,(Yx|fx()PYx(1|)(,Set fxfxbx()()(;)(?;?mmmm1

    110.4 Exponential Loss and AdaBoost . PYx(1|)((*1exp(2())?,fxC. Adaboost is equivalent to this algorithm using the loss function :

    ? Adaboost is estimating one-half the log-odds of . This PYx(1|)( Lyfxyfx(;())exp(())(,

    justifies using its sign as a classification rule. N

    With or (,)argminexp((()()));;GyfxGx(,??,1mmGimiiBut the deviance loss criterion admits the same population minimizer (1i

    interpreting as in a logistic model. Px()N

     (,)argminexp(());;GyGx(,?,mmGii (1i

    ()mexp(())1fxwyfx(,exp((())iimi1 PxPYx()(1|)((((where *exp(())exp(())1exp(2())fxfxfx?,?,

     N()mThe solution is : , GwIyGx(?argmin(Y1;;'''?miimiWith ,,lYpxYpxYpx(,())log(())(1)log(1())(?,,GY(?{0,1}(1i2

     . (,1/2log((1)/)errerrmmm. ,(?,lYpxYfx(,())log(1exp(2()))

    N()margmin[(,())]argmin[exp(())]ElYpxEYfx,(,WIyGx()?Thus . ;;?YxYx||iimi(i1fxfx()() . err(mwhereN()mW?i(i1NB: With K?class classification, the response Y takes values in the set

    10.5 Why Exponential Loss ;;?!!!?;({} Now Gx()( where .With the kpx(argmax()1KKll

    exp(())fxk(logistic model, Then px() lNexp(())fx?l(1i

    Boosting and Additive Trees




    . ,(?Iyfxfx()log()log(exp(()))??Kkl((11kk

    B. Finite Sample

    Deviance is far more robust than exponential loss. Penalty associated with deviance increases linearly for large negative margins (,())Yfx

    whereas the exponential criterion increases the influence of such observations exponentially. N.B.: Square error loss is not a good criterion for classification.

C. Loss functions for regression

    Classification: exponential loss deviance ? Regression : squared error loss absolute loss ?10.7 "Off-the-Shelf" Procedures for Data Mining

    2Data mining requires to deal with : (())yfx?Lyfx(,()):|()|yfx

    ; Messy data Minimizer : Median (Y |x) ?EYx(|)

    ; Missing values They are identical for symmetric distribution but lack of robustness for

    ; Long tailed or highly skewed distributions of the variables squared error loss.

    ; Outliers Compromise between robustness and efficiency : Huber

    ; Predictor variables measured on different scales 2?(())|()|,yfxforyfx~,,) Lyfx(,())(?; Irrelevant predictor variable (|()|/2).yfxotherwise,,~~?

    ; interpretable models

    ; computation speed

    Boosting and Additive Trees

    Decision trees come closest to meeting those requirements.But decision trees are not accurate ? solution : use boosting decision trees but some advantages are sacrified like speed, interpretability and robustness. A multiple additive regression tree is a generalization of tree boosting that attempts to mitigate these problems.


    Boosting trees ?(fTx(;)?TxIxR(;)()??(???jjMm(1(1mj

    ? forward stagewise additive modeling

    N^~Jm with . ??({,}R??(?argmin(,()(;))LyfxTxm?,,1(?mjmjmj1imiimm(1i

    Given the regions argmin(,())Lyfx??imijm1,?jm,x?iRjm,

    Problem : Robust criterion does not give rise to simple fast boosting algorithms.

    10.8 Illustration : the Spam data

    PSpamx(|)r fx()log(PEmailx(())r

    Boosting and Additive Trees

Erro Rates :

    ; Additive Logistic

    Regreesion :5.3%

    ; CART(fully grown and pruned by CV :8.7%)

    ; MARS : 5.5% (standard error of estimates : 0.6%)

    ; =2 terminal node tree : 4.6% J

    ; MART : 4%

    ? interactions

    One Variable Shows dependence of log-odds with predictor

    10.9 Boosting Trees

    ; Decision tree: xRfx??(()?jj


    ; Formal Expression: TxIxR(;)(),(???jj(1j

    J,({,}R?; Parameter: 1jj

    J ^

    ; Optimization Process: ,(argmin(,)Ly???ijShows interactions among the predictor variables (?1jxRij

    ; Running MART with J=2 (main effects model) yields a higher error ; Approximation

    rate when compared to running with larger J ; Finding given ?Rjj

    Boosting and Additive Trees

    ; Trivial ; N-class w/ Exponential loss :

    J^; Estimating is often the mean/mode of in region ?y()mj ,(,argminexp[(;)]wyTx?m?iiimm(1j Rj

    ; can be found by (10.31) - weighted log-odds in each ?; Finding Rj


    ; Difficult ; Regression: Absolute Error, Huber Loss

    ; Typical way is to use a greedy, top-down recursive ; Classification: Deviance

    partitioning algorithm ; Will robustify boosting trees

    ; Can also approximate by a smoother and more convenient ; However, they do not give rise to simple fast boosting

    criterion (10.26) algorithms

    Boosted Trees 10.10 Numerical Optimization M

    Sum of Trees ; Solving each “step” in FSAM numerical optimization fTx((;)??Mm(1m; Differentiable loss criterion N^NSolve using FSAM ,(?argmin(,()(;))LyfxTx??1imiimm; Total loss : LfLyfx()(,())((1i?ii(1i

     ??(?argmin(,())Lyfx?jmimijm1^?jmx?iRjm; Goal : fLf(argmin()f

    The difficult part is finding Rjm; is a vector f

    Solving the FSAM Problem ; “Parameters” of are the values at each data point f

    ; Some special cases are easier ffxfxfx({(),(),...,()}12N; Square-error loss: find the tree that best predict the current

    ; Numerical optimization solves the problem with a sum of component residual M

    ; Two-class w/ Exponential loss: Adaboost.M1; tree that vectors ,. fh(fh(?Mm00m(0minimize weighted error rate; {-1, +1}

    Boosting and Additive Trees

    N~10.10.1 Steepest Descent 2 For -class classification, L.S. KK,(,,argmin(g(;))Tx?m?imim(1Lyfx(,())iii; Greedy Strategy g[](()()(imfxfximi1fx()regression trees are constructed. i

    Lyfxfx(,(),...,()) (((,argmin()Lfgimikmi1,,mmm1(,and g[]()()(((,IypximiKkifx()kmi ffg(,(mmm1exp(())fxk. (px()kK; Gradients in Table 10.2 exp(())fx?l(1l10.10.2 Gradient Boosting

    10.10.3 MART (Multiple Additive Regression Tree) N^

    ; Simplifying ,(?argmin(,()(;))LyfxTx?; Generalization of tree boosting ?1imiimm(1i

    ; Tries to mitigate problem of decision trees from being less accurate N~2; To ,(,,,argmin(g(;))Tx?than the best classifier for a particular problem imim(1i

    Algorithm 10.3: (Gradient tree boosting for MART) ; Rationale: Minimize Loss vs. Generalization NfxLy()argmin(,)(?1. Initialize . ?0?Gradient boosting for regression i(1i

    Lyfx(,())2. For to M: m(1iiAt each step, compute . g[](()()(imfxfximi1fx()iLyfx(,())ii; For compute . iN(1,2,...,g[]((imffmN.B.: difference with previous technique : the components fx()i

     not independent. They are constrained to be tTxTx(((;),...,(;))??; Fit a regr. tree to giving terminal regions . gRmmNm1imjm,the prevision of a Jm? terminal node decision tree whereas the negative ; For : jJ(1,2,...,mgradient is the unconstrained maximal descent direction. Ultimate goal: ??(?argmin(,())Lyfx. ?jmimi,1?xR?ijm,thgeneralize fx() to new data points. So at the iteration, induce a mMJm; Update . fxfxIxR()()()(????1,mmjmjmregression tree to the negative gradient : (1j


    3. Output . ffx(()mM

    Boosting and Additive Trees

    10.11 Right-Sized Trees for Boosting Tree Size Comparison ; Size of tree (J: number of terminal nodes) for each iteration of Choosing a J

    boosting ; Typically ; Simple strategy: constant = 2 will be insufficient ; JJ; How to find ? ; > 10 will be highly JJ

    ; Minimize prediction risk on future data unlikely ; Most problems have low -order interaction effects dominating the ; works well in 48))J

    problem space boosting by experience ; Thus, models with high - order interaction will suffer in accuracy ; =6 should be the initial J; Interaction effects are limited by Jguess

    ; No interaction effects of level greater than -1 are possible K10.12 Regularization

    ; =2: Decision Stump (only main effects, no interactions) J; Regularization : prevention of overfitting of data by models

    ; Example: Parameter M ; =3:two-ariable interaction effects are allowed J

    ; Increases M reduces the training risk Problem:

    ; Could lead to overfitting Trees tend to be much too large ( terminal nodes) Loss in (#J?

    ; Use a hold- out set accuracy and computing time.

    ; Similar to early stopping strategy in NN In fact, the target function can be written in ?(argmin(,())ELYfXfXY

    10.12.1 Shrinkage terms of interactions ????()()()()...XXXkX(????jjjkjjkljkljjkjkl; Scale the contribution of each tree by a factor 01,,v

    J^No interaction effects of level greater than are possible ? fixed J1

    . fxfxvIxR()()()(????,jm1,mmjm (good choice ) . J48))J(j1

     ; Controlling the learning rate of the boosting procedure.

     ; , M; , M. vv????

     ; Empirically, smaller v favor better test error but longer training time.

    Boosting and Additive Trees

    ; Best strategy is to choose a small (< 0.1) and find by early ; Ridge regression. Mvv

    N^stopping. 2; Lasso ,:?:?:()argmin{(())()}(,??yTxJ??ikki(1ikShrinkage vs. No Shrinkage

    KK2,. J()::(J()||::(??kk((11kk

    ; Many alphas will be zero with a large lambda.

    Only a fraction of possible tress are relevant.

    ; Problem:

    ; Still can’t solve for all possible tress.

    ; Solution:

    ; Forward stagewise strategy.

    ; Initialize to alpha = 0 first.

    ; More iterations lead to smaller alphas.

    Penalized Regression

10.12.2 Penalized Regression

    ; Consider the set of all possible J-terminal node regression trees as

    basis functions


    ; Thus, the linear model: fxTx()()(?kk (1k

    ; The approximation works (approximates lasso) ; K = cart(T) and is likely to be much larger than any possible

    ; Tree boosting with shrinkage resembles penalized regression training set

    ; No shrinkage is analogous to subset selection(penalizes the number ; Thus, penalized least squares is required to find the alphas.

    of non-zero coefficients) ; Penalty Function.

    Boosting and Additive Trees

    ; Breiman et al. (1984) proposed a measure of relevance for each Penalized Regression in Action

    predictor variable for a single decision tree ; Intuition: variable is the one that gives maximum estimated

    improvement in squared error risk

    ; Simply average over the trees for additive models ; Also works for K- class classifiers (Pg. 332) Relevance of a predictor . Xl

    22J1??2ITiIvtl()(())((In a single tree: ,with maximal estimated itt?lt(1

    M122improvement in squared error risk.In a boosting tree: . IIT(()?llmM(1m

    M12210.12.3 Virtues of the L1 Penalty (Lasso) over L2 For-class classification: is the relevance of in KXIIT(()?l,lklkmM(1m; Superior performance of boosting over procedures such as SVM may

    separating the class from the other classes. Kbe largely due to the implicit use of L1 versus L2 penalty

    K122; L1 penalty is better suited to sparse situations(Donoho et al., 1995) ? overall relevance of :. XII(?lllkK(1k; Though minimization of L1-penalized problem is much more

    10.13.2 Partial Dependence Plots difficult than that for L2

    ; Visualization is a great tool but is limited to low- dimensional views ; The forward stagewise approach provides an approximate, practical

    ; Marginal average of a model given a subset of input variables and the way to tackle the problem

    complement of that within all input variables 10.13 Interpretation

    ; Works for k- class problems as well ; Single decision tress are highly interpretable

    X of dim. lpXXXX,,(((,...,)), and SP{1,2,...,}SSP1; Linear combination of tress lose this feature

    .So fXfXX()(,)(. SCP:({1,2,...,}; How to interpret the model then? SC

    10.13.1 Relative importance of Predictor Variables

Report this document

For any questions or suggestions please email