Boosted Regression Tree SDM explained

Modified on Wed, 22 Feb, 2023 at 3:55 PM

Introduction

Boosted Regression Tree (BRT) models are a combination of two techniques: decision tree algorithms and boosting methods. Like Random Forest models, BRTs repeatedly fit many decision trees to improve the accuracy of the model. One of the differences between these two methods is the way in which the data to build the trees is selected. Both techniques take a random subset of all data for each new tree that is built. All random subsets have the same number of data points and are selected from the complete dataset. Used data is placed back in the full dataset and can be selected in subsequent trees. While Random Forest models use the bagging method, which means that each occurrence has an equal probability of being selected in subsequent samples, BRTs use the boosting method in which the input data are weighted in subsequent trees. The weights are applied in such a way that data that was poorly modelled by previous trees has a higher probability of being selected in the new tree. This means that after the first tree is fitted the model will take into account the error in the prediction of that tree to fit the next tree, and so on. By taking into account the fit of previous trees that are built, the model continuously tries to improve its accuracy. This sequential approach is unique to boosting.

Boosted Regression Trees have two important parameters that need to be specified by the user.

Tree complexity (tc): this controls the number of splits in each tree. A tc value of 1 results in trees with only 1 split, and this means that the model does not take into account interactions between environmental variables. A tc value of 2 results in two splits and so on.
Learning rate (lr): this determines the contribution of each tree to the growing model. As small value of lr results in many trees to be built.

These two parameters together determine the number of trees that is required for optimal prediction. The aim is to find the combination of parameters that results in the minimum error for predictions. As a rule of thumb, it is advised to use a combination of tree complexity and learning rate values that result in a model with at least 1000 trees. The optimal ‘tc’ and ‘lr’ values depend on the size of your dataset. For datasets with <500 occurrence points, it is best to model simple trees (‘tc’ = 2 or 3) with small enough learning rates to allow the model to grow at least 1000 trees.

Boosted Regression Trees are a powerful algorithm and work very well with large datasets or when you have a large number of environmental variables compared to the number of observations, and they are very robust to missing values and outliers.

Advantages

Can be used with a variety of response types (binomial, gaussian, poisson)
Like all tree models, accounts for interactions between independent variables exceptionally well
Stochastic, which improves predictive performance
The best fit is automatically detected by the algorithm
Robust to missing values, and outliers

Limitations

Needs at least 2 predictor variables to run
May overfit especially if representativeness of absence data does not match presence data

Assumptions

No formal distributional assumptions, boosted regression trees are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal.

Requires absence data

Yes

Configuration options

Biosecurity Commons allows the user to set model arguments as specified below.

random_seed	Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation.
tree_complexity	Sets the complexity by setting the number of splits in individual trees. BRT automatically accounts for interactions between variables if the tree complexity is set to the number of expected interactions between variables. 2 or 3 is a good value to select if you think you have interactions between variables. (default = 1)
Learning_rate	Set the weight applied to individual trees. Set this value too high and your model may miss the signal. Set it too low and your model may take a very long time to run with little gain for the extra hours of run time. (default = 0.01)
Bag_fraction	Fraction or proportion of the data (observations) that are randomly selected to build the next tree in the model. (default = 0.75)
Number of cross validation sub-sets (n_folds)	The number of subsets for cross-validation, from training (cv - 1) and testing (1) data. Remember if you sample size is small you may not have enough data to generate 10 sensible subsets (default = 10).
Prevalence stratify (prev_stratify)	Whether subsets should be stratified. For binomial data, each subset will thus contain roughly the same proportion of each data class, for example presence/absence. (default = TRUE)
Family (family)	Distribution of the response variable. (default = bernoulli)
Number of trees added each cycle (n.trees)	Number of initial trees to fit, and then added to the model at each cycle. For example, if the default of 50 is selected, the model will start with fitting 50 trees using recursive binary partitioning of the data. Residuals from the initial fit are then fitted with another set of 50 trees, these residuals are then fitted with another set of trees, and so forth, whereby the process focuses more and more on poorly modelled occurrences from previous sets of trees. (default = 50)
Maximum number of trees (max_trees)	Maximum number of trees to fit before stopping. (defaul = 10000)
Tolerance method (tolerance_method)	Method used in deciding to stop. If this is set to 'fixed', the value indicated in 'tolerance value' is used. If this is set to 'auto', the value is 'tolerance value*total mean deviance'. (default = auto)
Tolerance value (tolerance_value)	Value to use in 'tolerance method'. (default = 0.001)

Additional information

Methods to “tune” BRT models have resulted in models outperforming many alternative kinds of models. These steps are not yet available in Biosecurity Commons dashboard tools, but can be found at links below.

https://rspatial.org/raster/sdm/9_sdm_brt.html

Elith, J., Leathwick, J.R., and Hastie, T. (2008). Boosted regression trees - a new technique for modelling ecological data. Journal of Animal Ecology 77(4) 802-813.

Simple examples are also in the EcoCommons step 3 SDM in R module, but the data in step 3 is built from steps 1 & 2.

References

De’ath, G. (2007). Boosted Trees for Ecological Modeling and Prediction. Ecology, 88(1), 243–251.
Elith, J., Leathwick, J. R., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77(4), 802–813.
Franklin, J. (2010). Mapping Species Distributions: Spatial Inference and Prediction. Cambridge University Press.

Additional Reading

Harris, D. J. (2015). Generating realistic assemblages with a joint species distribution model. Methods in Ecology and Evolution, 6(4), 465–473.
Jarnevich, C. S., & Young, N. E. (2019). Not so Normal Normals: Species Distribution Model Results are Sensitive to Choice of Climate Normals and Model Type. Climate, 7(3), 37.
La Marca, W., Elith, J., Firth, R. S. C., Murphy, B. P., Regan, T. J., Woinarski, J. C. Z., & Nicholson, E. (2019). The influence of data source and species distribution modelling method on spatial conservation priorities. Diversity and Distributions, 25(7), 1060–1073.
Mateo, R. G., Gastón, A., Aroca-Fernández, M. J., Saura, S., & García-Viñas, J. I. (2018). Optimization of forest sampling strategies for woody plant species distribution modelling at the landscape scale. Forest Ecology and Management, 410, 104–113.
Niamir, A., Skidmore, A. K., Muñoz, A.-R., Toxopeus, A. G., & Real, R. (2019). Incorporating knowledge uncertainty into species distribution modelling. Biodiversity and Conservation, 28(3), 571–588.
Oyafuso, Zack. S., Drazen, J. C., Moore, C. H., & Franklin, E. C. (2017). Habitat-based species distribution modelling of the Hawaiian deepwater snapper-grouper complex. Fisheries Research, 195, 19–27.
Rodríguez-Rey, M., Consuegra, S., Börger, L., & Leaniz, C. G. de. (2019). Improving Species Distribution Modelling of freshwater invasive species for management applications. PLOS ONE, 14(6), e0217896.
Wang, H.-H., Wonkka, C. L., Treglia, M. L., Grant, W. E., Smeins, F. E., & Rogers, W. E. (2015). Species distribution modelling for conservation of an endangered endemic orchid. AoB PLANTS, 7, plv039.