Introduction
Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression method that builds multiple linear regression models across the range of predictor values. It does this by partitioning the data, and run a linear regression model on each different partition.
The MARS algorithm is an extension of linear models that makes no assumptions about the relationship between the response variable and the predictor variables. While Generalized Linear Models and Generalized Additive Models assume that the coefficients of the predictor variables are constant across all values of a predictor, the MARS algorithm specifically takes into account that this is often not the case. But the MARS algorithm also has similarities to machine learning models such as tree-based models, because it uses a similar iterative approach.
The MARS algorithm builds a model in two steps. First, it creates a collection of so-called basis functions (BF). In this procedure, the range of predictor values is partitioned in several groups. For each group, a separate linear regression is modeled, each with its own slope. The connections between the separate regression lines are called knots. The MARS algorithm automatically searches for the best spots to place the knots. Each knot has a pair of basis functions. These basis functions describe the relationship between the environmental variable and the response. The first basis function is ‘max(0, env var - knot), which means that it takes the maximum value out of two options: 0 or the result of the equation ‘environmental variable value – value of the knot’. The second basis function has the opposite form: max(0, knot - env var).
For example, if the value of the environmental variable at the knot is 11, then:
Basis function 1: for any value below 11, the outcome of ‘Env var – Knot’ will result in a negative number, which is smaller than 0 and thus the outcome of the basis function is 0. This means that the outcome of basis function 1 is 0 for all environmental values up to the knot, while for all values after the knot, the outcome of basis function 1 is the value of the environmental variable minus 11.
Basis function 2: this has the opposite form, with the outcome of 0 for all environmental values after the knot, and the outcome of 11 minus value of environmental variable before the knot.
In the second step, MARS estimates a least-squares model with its basis functions as independent variables. It fits a very large model, that is subsequently pruned (like tree-based models) to avoid overfitting by iteratively removing basis functions that contribute the least to model fit.
Advantages
Works well with a large number of predictor variables
Automatically detects interactions between variables
It is an efficient and fast algorithm, despite its complexity
Robust to outliers
Limitations
Susceptible to overfitting
More difficult to understand and interpret than other methods
Not good with missing data
Assumptions
No assumptions are made about the distributions of the environmental variables. However, they should not be highly correlated with one another because this could cause problems with the estimation.
Requires absence data
Yes.
Configuration options
Biosecurity Commons allows the user to set model arguments as specified below.
random_seed | Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation. |
Number of repetitions (nb_run_eval) | Integer value, corresponding to the number of repetitions to be done for calibration/validation splitting. (default = 10) |
Data split percentage (data_split) | Numeric value between 0 and 100, corresponding to the percentage of data used to calibrate the models (calibration/validation splitting). (default = 100) |
prevalence | Allows to give more or less weight to particular observations; default = NULL: each observation (presence or absence) has the same weight; if value < 0.5: absences are given more weight; if value > 0.5: presences are given more weight. (algorithm parameter) |
Variable importance (var_import) | Integer value, corresponding to the number of permutations to be done for each variable to estimate variable importance. (default = 0) |
Scale models (rescale_all_models) | A logical value defining whether all models predictions should be scaled with a binomial GLM or not. (default = FALSE) |
Evaluate all models (do_full_models) | A logical value defining whether models calibrated and evaluated over the whole dataset should be computed or not. (default = TRUE) |
Regression type (type)
| Type of regression to model: linear ("simple"), quadratic or polynomial. (default = simple )
|
Variable interaction (interaction_level)
| Number of interactions between predictor variables that need to be considered. (default = 0)
|
Maximum term (nk)
| Maximum number of terms in the model before pruning. (default = NULL )
|
Penalty (penalty)
| Generalized cross validation (gcv) penalty per knot; default = 2 if interaction level = 1, or 3 if interaction level > 1. (default = 2 )
|
Threshold (thres)
| Forward stepwise stopping threshold; the forward pass terminates if adding a term changes RSq by less than the threshold. (default = 0.001)
|
Maximum number pruned (nprune)
| Maximum number of terms in the pruned model. (default = NULL )
|
Prunning method (pmethod)
| Pruning method. (default = 'backward')
|
References
- Elith, J., H. Graham, C., P. Anderson, R., Dudík, M., Ferrier, S., Guisan, A., J. Hijmans, R., Huettmann, F., R. Leathwick, J., Lehmann, A., Li, J., G. Lohmann, L., A. Loiselle, B., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., McC. M. Overton, J., Townsend Peterson, A., … E. Zimmermann, N. (2006). Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29(2), 129–151.
- Franklin, J. (2010). Mapping species distributions: spatial inference and prediction. Cambridge University Press.
- Friedman, J. H. (1991). Multivariate Adaptive Regression Splines. The Annals of Statistics, 19(1), 1–67.
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. 2nd edition, Springer.
- Leathwick, J. R., Elith, J., & Hastie, T. (2006). Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecological Modelling, 199(2), 188–196.
- Milborrow, S. (2015). Notes on the earth package. http://www.milbo.org/doc/earth-notes.pdf
Additional Reading
- Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE, 13(6), e0199292.
- Choe, H., Thorne, J. H., & Seo, C. (2016). Mapping National Plant Biodiversity Patterns in South Korea with the MARS Species Distribution Model. PLOS ONE, 11(3), e0149511.
- Ducci, L., Agnelli, P., Di Febbraro, M., Frate, L., Russo, D., Loy, A., Carranza, M. L., Santini, G., & Roscioni, F. (2015). Different bat guilds perceive their habitat in different ways: A multiscale landscape approach for variable selection in species distribution modelling. Landscape Ecology, 30(10), 2147–2159.
- Iturbide, M., Bedia, J., Herrera, S., del Hierro, O., Pinto, M., & Gutiérrez, J. M. (2015). A framework for species distribution modelling with improved pseudo-absence generation. Ecological Modelling, 312, 166–174.
- Jarnevich, C. S., & Young, N. E. (2019). Not so Normal Normals: Species Distribution Model Results are Sensitive to Choice of Climate Normals and Model Type. Climate, 7(3), 37.
- Phillips, N. D., Reid, N., Thys, T., Harrod, C., Payne, N. L., Morgan, C. A., White, H. J., Porter, S., & Houghton, J. D. R. (2017). Applying species distribution modelling to a data poor, pelagic fish complex: The ocean sunfishes. Journal of Biogeography, 44(10), 2176–2187.
- Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology, 48(12), 1549–1555.
- Rose, P. M., Kennard, M. J., Moffatt, D. B., Sheldon, F., & Butler, G. L. (2016). Testing Three Species Distribution Modelling Strategies to Define Fish Assemblage Reference Conditions for Stream Bioassessment and Related Applications. PLOS ONE, 11(1), e0146728.
- Vacchiano, G., & Motta, R. (2015). An improved species distribution model for Scots pine and downy oak under future climate change in the NW Italian Alps. Annals of Forest Science, 72(3), 321–334.
- Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators, 104, 333–340.