ensemble-suggestions.Rmd

---
title: "Suggestions for performant supervised learning ensembles"
author: "Chris J. Kennedy"
date: "February 3, 2017"
output: html_document
---

Abstract

Introduction

Learners

Tier 1:

* Elastic net
* Random Forest
* GBM / XGBoost

Tier 2:

* K-nearest neighbors
* Bayesian additive regression trees
* Bagging
* Support vector machines
* Neural networks
* Decision trees
* Support vector machines
* Multivariate adaptive regression splines
* GAMs

Tier 3:

* Bayesian regression
* Polymars

Other topics:

- Internal cross-validation vs SuperLearner ensembling
- Screening algorithms
- Review of articles with good/not-as-good libraries
- High-stability cross-validation

# Summary

# Introduction

# Learners

## Elastic net

This is the best single algorithm and should always be included in an SL library. The lasso configuration (alpha = 1) is often the best, but including 5-6 configurations is more thorough.

* Hyperparameters
    * Alpha is critical to optimize. It controls the weighting between the ridge (L2) penalty and the lasso (L1) penalty.
    * Lambda is automatically optimized internally, no need to explicitly optimize.

## Random forest

* Hyperparameters
    * Feature sampling (mtry)
    * Maximum leaf nodes - Breiman thought that decision trees should always be maximally large but that turned out to be an artifact of the datasets he used (Segal & Xiao, 2011). So it is helpful to check if different constraints on tree size can improve performance.
    * Split criterion - can be worth comparing information gain (entropy) to gini ([see Erin LeDell benchmark post](http://www.wise.io/tech/benchmarking-random-forest-part-1))
    * Number of trees - does not need to be explicitly optimized as RF does not overfit as the number of trees increases. Instead it will reach at performance plateau, afterwhich there is no advantage to more trees.
    
#### Further reading

* Segal, M., & Xiao, Y. (2011). Multivariate random forests. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 80-87.

## GBM / XGBoost

* Packages
    * R - XGBoost is preferable to GBM, as it is faster and more configurable.
    * Python - XGBoost
* Hyperparameters
    * Number of trees
    * Shrinkage (aka penalization or learning rate)
    * Tree depth
    * Feature subsampling

## K-nearest neighbors

* Preprocessing
    * Variables need to be standardized (mean 0, sd 1) unless something other than euclidian distance (squared L2 norm) is used as the distance metric (e.g. Mahalnobis).
* Hyperparameters
    * K - critical to optimize, as there is no reason for the default to be the best.
    * Alternative distance metrics are good to try, e.g. Mahalanobis.
    * Alternative weighting kernels are good to try, e.g. triangular rather than uniform (default).
* Feature screening
    * Helpful to try, because kNN treats all variables equally.
* Other notes

## Support vector machines

* Packages
    * R - kernlab is the best overall package. However svmpath can efficiently estimate the optimal regularization parameter and is worth considering.
* Preprocessing
    * It is important to center and scale the data before using SVM.
* Hyperparameters
    * Kernel - radial (RBF aka gaussian) is the best initial choice. Can  be worth trying polynomial but not critical. There is no need to try a linear kernel as it is encapsulated within the polynomial kernel, unless the number of features is very large.
    * Regularization parameter C - non-negative error budget for the number of misclassifications allowed, critical to establish bias-variance trade-off. When C is small we want low bias but high variance, and the reverse when C is large. Thorough C grid points: $C \in \{2^{−5},2^{−3},...,2^{15}\}$
    * Scale parameter Gamma ($\gamma$) aka Sigma ($\sigma$) is effectively the inverse bandwidth of the SVM kernel. So a large gamma/sigma corresponds to a wide bandwidth used to calculate proximity, meaning that a wider range of points are incorporated.  When gamma/sigma is small only very nearby observations are used. Thorough grid points: $\gamma or \sigma \in \{2^{-15}, 2^{-13}, ..., 2^{3}\}$. Notably, a good initial guess is generated by kernlab's `sigest` function, which may allow one to skip optimizing this hyperparameter.
    

#### Further reading

* [Cite Hsu et al. 2016 SVM guide]
* Intro to Statistical Learning, Chapter 9
* Elements of Statistical Learning, Chapter 12
* Learning with Kernels
* Modern Multivariate Statistical Techniques, Chapter 11

## Bayesian additive trees

* Hyperparameters
    * num_trees, alpha, beta, k

#### Further reading

* Chipman e.a. 2010
* Chipman e.a. 2007
* Chipman e.a. 2002
* Chipman e.a. 1998

## Neural networks

* Hyperparameters
    * Hidden nodes

## Bagging

* Hyperparameters
    * Number of replications

## Decision trees

* Hyperparameters
    * Complexity parameter

## Multivariate adaptive regression splines

## Generalized additive models

# Other topics

(To be added)

# Discussion

# References