-
Notifications
You must be signed in to change notification settings - Fork 8
/
ensemble-suggestions.Rmd
150 lines (104 loc) · 4.9 KB
/
ensemble-suggestions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "Suggestions for performant supervised learning ensembles"
author: "Chris J. Kennedy"
date: "February 3, 2017"
output: html_document
---
Abstract
Introduction
Learners
Tier 1:
* Elastic net
* Random Forest
* GBM / XGBoost
Tier 2:
* K-nearest neighbors
* Bayesian additive regression trees
* Bagging
* Support vector machines
* Neural networks
* Decision trees
* Support vector machines
* Multivariate adaptive regression splines
* GAMs
Tier 3:
* Bayesian regression
* Polymars
Other topics:
- Internal cross-validation vs SuperLearner ensembling
- Screening algorithms
- Review of articles with good/not-as-good libraries
- High-stability cross-validation
# Summary
# Introduction
# Learners
## Elastic net
This is the best single algorithm and should always be included in an SL library. The lasso configuration (alpha = 1) is often the best, but including 5-6 configurations is more thorough.
* Hyperparameters
* Alpha is critical to optimize. It controls the weighting between the ridge (L2) penalty and the lasso (L1) penalty.
* Lambda is automatically optimized internally, no need to explicitly optimize.
## Random forest
* Hyperparameters
* Feature sampling (mtry)
* Maximum leaf nodes - Breiman thought that decision trees should always be maximally large but that turned out to be an artifact of the datasets he used (Segal & Xiao, 2011). So it is helpful to check if different constraints on tree size can improve performance.
* Split criterion - can be worth comparing information gain (entropy) to gini ([see Erin LeDell benchmark post](http://www.wise.io/tech/benchmarking-random-forest-part-1))
* Number of trees - does not need to be explicitly optimized as RF does not overfit as the number of trees increases. Instead it will reach at performance plateau, afterwhich there is no advantage to more trees.
#### Further reading
* Segal, M., & Xiao, Y. (2011). Multivariate random forests. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 80-87.
## GBM / XGBoost
* Packages
* R - XGBoost is preferable to GBM, as it is faster and more configurable.
* Python - XGBoost
* Hyperparameters
* Number of trees
* Shrinkage (aka penalization or learning rate)
* Tree depth
* Feature subsampling
## K-nearest neighbors
* Preprocessing
* Variables need to be standardized (mean 0, sd 1) unless something other than euclidian distance (squared L2 norm) is used as the distance metric (e.g. Mahalnobis).
* Hyperparameters
* K - critical to optimize, as there is no reason for the default to be the best.
* Alternative distance metrics are good to try, e.g. Mahalanobis.
* Alternative weighting kernels are good to try, e.g. triangular rather than uniform (default).
* Feature screening
* Helpful to try, because kNN treats all variables equally.
* Other notes
## Support vector machines
* Packages
* R - kernlab is the best overall package. However svmpath can efficiently estimate the optimal regularization parameter and is worth considering.
* Preprocessing
* It is important to center and scale the data before using SVM.
* Hyperparameters
* Kernel - radial (RBF aka gaussian) is the best initial choice. Can be worth trying polynomial but not critical. There is no need to try a linear kernel as it is encapsulated within the polynomial kernel, unless the number of features is very large.
* Regularization parameter C - non-negative error budget for the number of misclassifications allowed, critical to establish bias-variance trade-off. When C is small we want low bias but high variance, and the reverse when C is large. Thorough C grid points: $C \in \{2^{−5},2^{−3},...,2^{15}\}$
* Scale parameter Gamma ($\gamma$) aka Sigma ($\sigma$) is effectively the inverse bandwidth of the SVM kernel. So a large gamma/sigma corresponds to a wide bandwidth used to calculate proximity, meaning that a wider range of points are incorporated. When gamma/sigma is small only very nearby observations are used. Thorough grid points: $\gamma or \sigma \in \{2^{-15}, 2^{-13}, ..., 2^{3}\}$. Notably, a good initial guess is generated by kernlab's `sigest` function, which may allow one to skip optimizing this hyperparameter.
#### Further reading
* [Cite Hsu et al. 2016 SVM guide]
* Intro to Statistical Learning, Chapter 9
* Elements of Statistical Learning, Chapter 12
* Learning with Kernels
* Modern Multivariate Statistical Techniques, Chapter 11
## Bayesian additive trees
* Hyperparameters
* num_trees, alpha, beta, k
#### Further reading
* Chipman e.a. 2010
* Chipman e.a. 2007
* Chipman e.a. 2002
* Chipman e.a. 1998
## Neural networks
* Hyperparameters
* Hidden nodes
## Bagging
* Hyperparameters
* Number of replications
## Decision trees
* Hyperparameters
* Complexity parameter
## Multivariate adaptive regression splines
## Generalized additive models
# Other topics
(To be added)
# Discussion
# References