lecture-20-ensembles

class: inverse, left, bottom

# Biological Statistics II
### Gavin Fay   
<br> __2025-04-15: Lecture 20, Ensemble methods__
<br> [gfay@umassd.edu](gfay@umassd.edu) 
<br> __Acknowledgements:__  [Alison Hill]("https://twitter.com/apreshill")

---

## Ensemble methods

What are ensembles?

_Ensembles_ 
Predictions based on some aggregation of multiple models to produce improved results.  
Idea that can obtain more accurate predictions than a single model would.

![](figs/hurricane_ensemble.gif)

---

![](figs/Hill_ML_map.png)

---

![](figs/4typesml.jpg)

.footnote[
img from Machine Learning for Everyone <https://vas3k.com/blog/machine_learning/>
]

---

## Approaches

### Voting

* Majority voting  
* Weighted voting

### Averaging

* Simple averaging  
* Weighted averaging

We've looked at model averaging via AIC.  
i.e. AIC weights  
- only for models that can be compared via AIC

Here we are combining predictions.

---

## Ensemble methods

### Stacking

### Boosting

### Bagging

---

## Stacking - combining by another model

![](figs/stacking.png)

.footnote[
[img from Machine Learning for Everyone]<https://vas3k.com/blog/machine_learning/>
]

---

## Boosting

"convert weak models to strong ones"

Incrementally build an ensemble by training each model with the same dataset but where the weights of instances are adjusted according to the error of the last prediction.

e.g. model 2 is fit to the residuals from model 1, etc.

- forces models to focus on observations that are poorly fit.  
 
Boosting is sequential.

---

## Boosting

![](figs/boosting.png)

.footnote[
[img from Machine Learning for Everyone]<https://vas3k.com/blog/machine_learning/>
]

---

## Boostrap Aggregating (Bagging)

.pull-left[
Obtain multiple model fits based on boostrapped datasets.

Recall Bootstrapped dataset:
Dataset of size equal to original, but resampling with replacement.

Fit (train) model to each data set.

Obtain model predictions for each boostrap.

Combine predictions.

We've done this with LMs and GLMs, but we can also do this with trees.  
]

.pull-right[
![](figs/bagbaby_yeah.jpg)
]

---

## Bagging

![](figs/bagging.png)

.footnote[
[img from Machine Learning for Everyone]<https://vas3k.com/blog/machine_learning/>
]

---

## Bagging with Classification Trees

.footnote[
Altman, N., Krzywinski, M. Ensemble methods: bagging and random forests. Nat Methods 14, 933–934 (2017). [https://doi.org/10.1038/nmeth.4438](https://doi.org/10.1038/nmeth.4438)
]

---

## Random Forest

Random Forest algorithm uses the bagging technique with some differences.

Random Forest uses random feature selection, and the base algorithm of it is a decision tree algorithm.

Bagging results in tree correlation that limits effects of variance reduction

Random forests try to reduce tree correlation by adding more randomness into the tree-growing process.

random forests perform split-variable randomization where each time a split is to be performed, the search for the split variable is limited to a random subset `mtry` of the original p features. Typical default values are mtry=p/3 (regression) and mtry=√p (classification) but this should be considered a tuning parameter.

Random forests have become popular because good out-of-box performance.

---

## Example, regression modeling of beach species richness

Workflow:

![](figs/Hill_workflow_test.png)

---

## Example, regression modeling of beach species richness

2 models:
linear regression
regression tree

``` r
RIKZ <- read_table2(file = "../data/RIKZ.txt")
```

```
hide Warning: `read_table2()` was deprecated in readr 2.0.0.
hide ℹ Please use `read_table()` instead.
hide This warning is displayed once every 8 hours.
hide Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
hide generated.
```

``` r
RIKZ <- RIKZ %>% mutate(Richness = rowSums(RIKZ[,2:76]>0)) %>% 
  select(Richness, 77:89) %>% 
  mutate(week = as.factor(week),
         Beach = as.factor(Beach))
```

``` r
library(tidymodels)

set.seed(8675309)
data_split <- initial_split(RIKZ, prop = 0.5)

rikz_train <- training(data_split)
rikz_test  <- testing(data_split)
```

---

background-image: url(figs/parsnip_logo.png)
background-position: 50% 80%
background-size: 20%

## linear model

We use the `parsnip` package from `tidymodels`.   
An advantage is we can fit lots of different types of models with the same workflow.

``` r
lm_fit <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(Richness ~ ., data = rikz_train)
```

---

## linear model

We use the `parsnip` package from `tidymodels`.   
An advantage is we can fit lots of different types of models with the same workflow.

``` r
lm_fit <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(Richness ~ ., data = rikz_train)
lm_fit
```

```
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = Richness ~ ., data = data)
## 
## Coefficients:
##   (Intercept)          week2          week3          week4         angle1  
##    -76.665849      -5.865260      -4.306872      -8.938132      -0.037125  
##        angle2       exposure       salinity    temperature            NAP  
##      0.020102       1.250073       1.991103       0.902894      -2.684827  
## penetrability      grainsize          humus          chalk       sorting1  
##      0.001264      -0.001016     -11.874936      -0.248567       0.042284  
##        Beach2         Beach3         Beach4         Beach5         Beach6  
##      3.237863             NA             NA             NA             NA  
##        Beach7         Beach8         Beach9  
##            NA             NA             NA
```

---

## linear model

``` r
lm_fit <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(Richness ~ ., data = rikz_train)
```

.pull-left[

``` r
test_results <- 
  rikz_test %>%
  select(Richness) %>%
  bind_cols(
    predict(lm_fit, new_data = rikz_test)
  )
test_results %>% slice(1:5)
```

```
## # A tibble: 5 × 2
##   Richness .pred
##      <dbl> <dbl>
## 1       11 12.7 
## 2       10 14.4 
## 3        9  9.97
## 4        8 13.5 
## 5       19 15.2
```
]

.pull-right[

``` r
# summarize performance
test_results %>% metrics(truth = Richness, estimate = .pred) 
```

```
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4.62 
## 2 rsq     standard       0.512
## 3 mae     standard       3.16
```
]

---

## Decision Tree

.pull-left[

``` r
rt_fit <- decision_tree(mode = "regression") %>% 
  set_engine("rpart") %>%
  fit(Richness ~ ., data = rikz_train)
rt_fit
```

```
## parsnip model object
## 
## n= 22 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 22 334.3636 5.727273  
##   2) grainsize>=248 15  38.4000 3.800000 *
##   3) grainsize< 248 7 120.8571 9.857143 *
```
]

.pull-right[

``` r
library(rpart.plot)
rpart.plot(rt_fit$fit)
```

![](MAR536_Lecture_20_ensembles_files/figure-html/unnamed-chunk-10-1.png)
]

---

## Decision Tree

``` r
test_results <- 
  test_results %>%
  rename(`linear model` = .pred) %>%
  bind_cols(
    predict(rt_fit, new_data = rikz_test) %>% 
      rename(`decision tree` = .pred)
  )
test_results %>% slice(1:3)
```

```
## # A tibble: 3 × 3
##   Richness `linear model` `decision tree`
##      <dbl>          <dbl>           <dbl>
## 1       11          12.7             9.86
## 2       10          14.4             9.86
## 3        9           9.97            9.86
```

``` r
# summarize performance
test_results %>% metrics(truth = Richness, estimate = `decision tree`) 
```

```
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      5.81  
## 2 rsq     standard      0.0704
## 3 mae     standard      3.96
```

---

## Plotting predictions
<img src="MAR536_Lecture_20_ensembles_files/figure-html/unnamed-chunk-13-1.png" width="100%" />

---

## Bagging

``` r
rf_defaults <- rand_forest(mode = "regression", mtry = .preds())
rf_defaults
```

```
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = .preds()
## 
## Computational engine: ranger
```

---

## Bagging

``` r
rf_defaults <- rand_forest(mode = "regression", mtry = .preds())
bg_xy_fit <- 
  rf_defaults %>%
  set_engine("ranger", importance = "impurity") %>%
  fit(Richness ~ ., data = rikz_train)
bg_xy_fit
```

```
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~.preds(),      x), importance = ~"impurity", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      22 
## Number of independent variables:  13 
## Mtry:                             13 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       7.544348 
## R squared (OOB):                  0.5261707
```

---

## Bagging, predictions

``` r
test_results <- 
  test_results %>%
  bind_cols(
    predict(bg_xy_fit, new_data = rikz_test) %>% 
      rename(`bagging` = .pred)
  )
test_results %>% slice(1:3)
```

```
## # A tibble: 3 × 4
##   Richness `linear model` `decision tree` bagging
##      <dbl>          <dbl>           <dbl>   <dbl>
## 1       11          12.7             9.86    8.81
## 2       10          14.4             9.86   10.2 
## 3        9           9.97            9.86   10.6
```

``` r
# summarize performance
test_results %>% metrics(truth = Richness, estimate = bagging) 
```

```
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4.64 
## 2 rsq     standard       0.380
## 3 mae     standard       2.94
```

---

## Random Forest

``` r
rf_xy_fit <- 
  rand_forest(mode = "regression") %>% 
  set_engine("ranger", importance = "impurity") %>%
  fit(Richness ~ ., data = rikz_train)
rf_xy_fit
```

```
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      22 
## Number of independent variables:  13 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       7.326581 
## R squared (OOB):                  0.5398477
```

---

## Random Forest

``` r
test_results <- 
  test_results %>%
  bind_cols(
    predict(rf_xy_fit, new_data = rikz_test) %>% 
      rename(`random forest` = .pred))
test_results %>% slice(1:3)
```

```
## # A tibble: 3 × 5
##   Richness `linear model` `decision tree` bagging `random forest`
##      <dbl>          <dbl>           <dbl>   <dbl>           <dbl>
## 1       11          12.7             9.86    8.81            8.66
## 2       10          14.4             9.86   10.2            10.2 
## 3        9           9.97            9.86   10.6            10.6
```

``` r
# summarize performance
test_results %>% metrics(truth = Richness, estimate = `random forest`) 
```

```
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4.77 
## 2 rsq     standard       0.342
## 3 mae     standard       3.13
```

---

## plotting predictions
<img src="MAR536_Lecture_20_ensembles_files/figure-html/unnamed-chunk-21-1.png" width="80%" />

---

## Importance of predictor variables

We see the effect of the random forest - de-emphasizes NAP in tree solutions.

```
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

![](MAR536_Lecture_20_ensembles_files/figure-html/unnamed-chunk-22-1.png)

---

## Takeaways

Ensemble methods rely on 'wisdom of the crowd', to (often) produce more accurate predictions with lower variance.

Rather than averaging or using majority vote, Stacking uses predictions from models as input to another model to produce the ensemble estimate.

Boosting iterates to try and improve fit for poorly fit observations.

Bagging is akin to bootstrap methods seen earlier in class.

'Out-of-bag' error rates (OOB) give an estimate of the test error rate.

Random Forest extends bagging.

These methods all have 'tuning' parameters that in many cases can be optimized.

Ensemble methods can produce accurate predictions but are often much less interpretable than single statistical models.

---

## Resources

Lots of learning resources on the new [tidymodels]("https://tidymodels.org") website!

Also lots of resources collated by [Alison Hill]("https://twitter.com/apreshill"):  
[https://alison.rbind.io/post/2019-12-23-learning-to-teach-machines-to-learn/](https://alison.rbind.io/post/2019-12-23-learning-to-teach-machines-to-learn/)

Altman, N., Krzywinski, M. Ensemble methods: bagging and random forests. Nat Methods 14, 933–934 (2017). [https://doi.org/10.1038/nmeth.4438]("https://doi.org/10.1038/nmeth.4438")

R for Statistical Learning (D. Dalpiaz) chap 27  
[https://daviddalpiaz.github.io/r4sl/ensemble-methods.html]("https://daviddalpiaz.github.io/r4sl/ensemble-methods.html")