Random Forest SDM explained

Modified on Thu, 23 Feb, 2023 at 12:15 PM

Introduction

Random Forests are an extension of single Classification Trees in which multiple decision trees are built with random subsets of the data. All random subsets have the same number of data points, and are selected from the complete dataset. Used data is placed back in the full dataset and can be selected in subsequent trees. In Random Forests, the random subsets are selected in a procedure called ‘bagging’, in which each data point has an equal probability of being selected for each new random subset. About two thirds of the total dataset is included in each random subset. The other third of the data is not used to build the trees, and this part is called the ‘out-of-the-bag’ data. This part is later used to evaluate the model.

Advantages

One of the most accurate learning algorithms available
It can handle many predictor variables
Provides estimates of the importance of different predictor variables
Maintains accuracy even when a large proportion of the data is missing

Limitations

Can overfit datasets that are particularly noisy
For data including categorical predictor variables with different number of levels, random forests are biased in favor of those predictors with more levels. Therefore, the variable importance scores from random forest are not always reliable for this type of data

Assumptions

No formal distributional assumptions, random forests are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal.

Requires absence data

Yes.

Configuration options

Biosecurity Commons allows the user to set model arguments as specified below.

Argument	Default value	Argument description
Random seed	NULL	Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation.
Number of repetitions (nb_run_eval)	10	Integer value, corresponding to the number of repetitions to be done for calibration/validation splitting.
Data split percentage (data_split)	100	Numeric value between 0 and 100, corresponding to the percentage of data used to calibrate the models (calibration/validation splitting).
Weighted response weights (Prevalence)	NULL	Allows the user to give more or less weight to particular observations. Each observation (presence or absence) has the same weight. If value <; 0.5: absences are given more weight; if value >; 0.5: presences are given more weight.
Variable importance (var_import)	0	Integer value, corresponding to the number of permutations to be done for each variable to estimate variable importance.
Scale models (rescale_all_models)	FALSE	A logical value defining whether all models predictions should be scaled with a binomial GLM or not.
Evaluate all models (do_full_models)	TRUE	A logical value defining whether models calibrated and evaluated over the whole dataset should be computed or not.
Do classification (do.classif)	TRUE	If TRUE, classification random.forest computed else regression random.forest will be done.
Number of trees (ntree)	500	Total number of trees to grow.
Select variables (mtry)	'default'	Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3).
Minimum node size (nodesize)	1	The minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time).
Maximum node size (maxnodes)	NULL	The maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize). If set larger than the maximum possible, a warning is issued.

References

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Cutler, D. R., Edwards Jr., T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., & Lawler, J. J. (2007). Random Forests for Classification in Ecology. Ecology, 88(11), 2783–2792.
Franklin, J. (2010). Mapping species distributions: spatial inference and prediction. Cambridge University Press.
Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.

Additional Reading

Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE, 13(6), e0199292.
Bosch, S., Tyberghein, L., Deneudt, K., Hernandez, F., & De Clerck, O. (2018). In search of relevant predictors for marine species distribution modelling using the MarineSPEED benchmark dataset. Diversity and Distributions, 24(2), 144–157.
De Clercq, E. M., Leta, S., Estrada-Peña, A., Madder, M., Adehan, S., & Vanwambeke, S. O. (2015). Species distribution modelling for Rhipicephalus microplus (Acari: Ixodidae) in Benin, West Africa: Comparing datasets and modelling algorithms. Preventive Veterinary Medicine, 118(1), 8–21.
Ducci, L., Agnelli, P., Di Febbraro, M., Frate, L., Russo, D., Loy, A., Carranza, M. L., Santini, G., & Roscioni, F. (2015). Different bat guilds perceive their habitat in different ways: A multiscale landscape approach for variable selection in species distribution modelling. Landscape Ecology, 30(10), 2147–2159.
Greiser, C., Hylander, K., Meineri, E., Luoto, M., & Ehrlén, J. (2020). Climate limitation at the cold edge: Contrasting perspectives from species distribution modelling and a transplant experiment. Ecography, 43(5), 637–647.
Jarnevich, C. S., & Young, N. E. (2019). Not so Normal Normals: Species Distribution Model Results are Sensitive to Choice of Climate Normals and Model Type. Climate, 7(3), 37.
Mateo, R. G., Gastón, A., Aroca-Fernández, M. J., Saura, S., & García-Viñas, J. I. (2018). Optimization of forest sampling strategies for woody plant species distribution modelling at the landscape scale. Forest Ecology and Management, 410, 104–113.
Niamir, A., Skidmore, A. K., Muñoz, A.-R., Toxopeus, A. G., & Real, R. (2019). Incorporating knowledge uncertainty into species distribution modelling. Biodiversity and Conservation, 28(3), 571–588.
Phillips, N. D., Reid, N., Thys, T., Harrod, C., Payne, N. L., Morgan, C. A., White, H. J., Porter, S., & Houghton, J. D. R. (2017). Applying species distribution modelling to a data poor, pelagic fish complex: The ocean sunfishes. Journal of Biogeography, 44(10), 2176–2187.
Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology, 48(12), 1549–1555.
Rodríguez-Rey, M., Consuegra, S., Börger, L., & Leaniz, C. G. de. (2019). Improving Species Distribution Modelling of freshwater invasive species for management applications. PLOS ONE, 14(6), e0217896.
Rose, P. M., Kennard, M. J., Moffatt, D. B., Sheldon, F., & Butler, G. L. (2016). Testing Three Species Distribution Modelling Strategies to Define Fish Assemblage Reference Conditions for Stream Bioassessment and Related Applications. PLOS ONE, 11(1), e0146728.
Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators, 104, 333–340.