Introduction
Random Forests are an extension of single Classification Trees in which multiple decision trees are built with random subsets of the data. All random subsets have the same number of data points, and are selected from the complete dataset. Used data is placed back in the full dataset and can be selected in subsequent trees. In Random Forests, the random subsets are selected in a procedure called ‘bagging’, in which each data point has an equal probability of being selected for each new random subset. About two thirds of the total dataset is included in each random subset. The other third of the data is not used to build the trees, and this part is called the ‘out-of-the-bag’ data. This part is later used to evaluate the model.
Advantages
- One of the most accurate learning algorithms available
- It can handle many predictor variables
- Provides estimates of the importance of different predictor variables
- Maintains accuracy even when a large proportion of the data is missing
Limitations
- Can overfit datasets that are particularly noisy
- For data including categorical predictor variables with different number of levels, random forests are biased in favor of those predictors with more levels. Therefore, the variable importance scores from random forest are not always reliable for this type of data
Assumptions
No formal distributional assumptions, random forests are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal.
Requires absence data
Yes.
Configuration options
Biosecurity Commons allows the user to set model arguments as specified below.
Argument | Default value | Argument description |
Random seed | NULL | Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation. |
Number of repetitions (nb_run_eval) | 10 | Integer value, corresponding to the number of repetitions to be done for calibration/validation splitting. |
Data split percentage (data_split) | 100 | Numeric value between 0 and 100, corresponding to the percentage of data used to calibrate the models (calibration/validation splitting). |
Weighted response weights (Prevalence)
| NULL
| Allows the user to give more or less weight to particular observations. Each observation (presence or absence) has the same weight. If value <; 0.5: absences are given more weight; if value >; 0.5: presences are given more weight. |
Variable importance (var_import) | 0 | Integer value, corresponding to the number of permutations to be done for each variable to estimate variable importance. |
Scale models (rescale_all_models) | FALSE | A logical value defining whether all models predictions should be scaled with a binomial GLM or not. |
Evaluate all models (do_full_models) | TRUE | A logical value defining whether models calibrated and evaluated over the whole dataset should be computed or not. |
Do classification (do.classif) | TRUE | If TRUE, classification random.forest computed else regression random.forest will be done. |
Number of trees (ntree) | 500 | Total number of trees to grow. |
Select variables (mtry) | 'default'
| Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3). |
Minimum node size (nodesize) | 1 | The minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). |
Maximum node size (maxnodes) | NULL | The maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize). If set larger than the maximum possible, a warning is issued. |
References
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
- Cutler, D. R., Edwards Jr., T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., & Lawler, J. J. (2007). Random Forests for Classification in Ecology. Ecology, 88(11), 2783–2792.
- Franklin, J. (2010). Mapping species distributions: spatial inference and prediction. Cambridge University Press.
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.
Additional Reading
- Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE, 13(6), e0199292.
- Bosch, S., Tyberghein, L., Deneudt, K., Hernandez, F., & De Clerck, O. (2018). In search of relevant predictors for marine species distribution modelling using the MarineSPEED benchmark dataset. Diversity and Distributions, 24(2), 144–157.
- De Clercq, E. M., Leta, S., Estrada-Peña, A., Madder, M., Adehan, S., & Vanwambeke, S. O. (2015). Species distribution modelling for Rhipicephalus microplus (Acari: Ixodidae) in Benin, West Africa: Comparing datasets and modelling algorithms. Preventive Veterinary Medicine, 118(1), 8–21.
- Ducci, L., Agnelli, P., Di Febbraro, M., Frate, L., Russo, D., Loy, A., Carranza, M. L., Santini, G., & Roscioni, F. (2015). Different bat guilds perceive their habitat in different ways: A multiscale landscape approach for variable selection in species distribution modelling. Landscape Ecology, 30(10), 2147–2159.
- Greiser, C., Hylander, K., Meineri, E., Luoto, M., & Ehrlén, J. (2020). Climate limitation at the cold edge: Contrasting perspectives from species distribution modelling and a transplant experiment. Ecography, 43(5), 637–647.
- Jarnevich, C. S., & Young, N. E. (2019). Not so Normal Normals: Species Distribution Model Results are Sensitive to Choice of Climate Normals and Model Type. Climate, 7(3), 37.
- Mateo, R. G., Gastón, A., Aroca-Fernández, M. J., Saura, S., & García-Viñas, J. I. (2018). Optimization of forest sampling strategies for woody plant species distribution modelling at the landscape scale. Forest Ecology and Management, 410, 104–113.
- Niamir, A., Skidmore, A. K., Muñoz, A.-R., Toxopeus, A. G., & Real, R. (2019). Incorporating knowledge uncertainty into species distribution modelling. Biodiversity and Conservation, 28(3), 571–588.
- Phillips, N. D., Reid, N., Thys, T., Harrod, C., Payne, N. L., Morgan, C. A., White, H. J., Porter, S., & Houghton, J. D. R. (2017). Applying species distribution modelling to a data poor, pelagic fish complex: The ocean sunfishes. Journal of Biogeography, 44(10), 2176–2187.
- Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology, 48(12), 1549–1555.
- Rodríguez-Rey, M., Consuegra, S., Börger, L., & Leaniz, C. G. de. (2019). Improving Species Distribution Modelling of freshwater invasive species for management applications. PLOS ONE, 14(6), e0217896.
- Rose, P. M., Kennard, M. J., Moffatt, D. B., Sheldon, F., & Butler, G. L. (2016). Testing Three Species Distribution Modelling Strategies to Define Fish Assemblage Reference Conditions for Stream Bioassessment and Related Applications. PLOS ONE, 11(1), e0146728.
- Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators, 104, 333–340.