Introduction
Generalized Linear Models (GLM) are an extension of ‘simple’ linear regression models, which predict the response variable as a function of multiple predictor variables. Linear regression models work on a few assumptions, such as the assumption that we can use a straight line to describe the relationship between the response and the predictor variables. This implies that a constant change in a predictor leads to a continuous change in the response variable. This assumption is often violated in ecological data; therefore, these models are extended into GLMs to deal with non-normal distributed data.
GLMs find the equation that best predicts a species’ occurrence for the values of the environmental variables. The model has three critical parts
- The probability distribution of the response variable.
- The linear predictor (LP), a combination of all predictor variables representing an overall score for the environmental suitability.
- The link function, describing how the mean of the response relates to the linear predictor.
Thus, the relationship between the response and the predictors is not linear, but the link function provides a transformation of the response so that the transformed response is linearly related to the predictors.
A GLM with binomial data, such as the presence/absence of a species, is commonly called “logistic regression”. In this case, the link function is a logit function, which is the log of the odds ratio (probability of presence/probability of absence) (Figure 1).
Figure 1. Plot showing how the relationship between a binary response variable and predictors can be made linear through transformation.
The coefficient of a predictor variable (the number used to multiply a variable) in a logistic regression model can be easily interpreted, as in the following hypothetical example. Suppose a predictor, such as average annual temperature, has a positive coefficient of 0.3 in an estimated model of the species occurrence. In that case, this implies that a one-unit increase in temperature results in a rise in exp (0.3) = 1.35 (the log-odds ratio), or 35%, in the probability of species presence.
The values of the variable coefficients are estimated by maximum likelihood estimation (MLE), which maximizes the "agreement" of the predicted species occurrences with the observed data. In other words, MLE finds the values of the coefficients that result in a model under which you would be most likely to get the observed results. Most GLM models, including the GLM provided in EcoCommons, use the iteratively reweighted least squares (IWLS) method for MLE.
Advantages
- The response variable can have any form of exponential distribution type
- Able to deal with categorical predictors
- Relatively easy to interpret and allows a clear understanding of how each of the predictors are influencing the outcome
- Less susceptible to overfitting than for example CTA or MARS algorithms
Limitations
- Needs relatively large datasets. The more predictor variables, the larger the sample size (N) required. As a rule of thumb, the number of predictor variables should be less than N/10.
- Sensitive to outliers
Assumptions
No assumptions are made about the distributions of the environmental variables. However, they should not be highly correlated with one another because this could cause problems with the estimation.
Requires absence data
Yes.
Configuration options
Biosecurity Commons allows the user to set model arguments as specified below.
random_seed | Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation. |
Number of repetitions (nb_run_eval) | Integer value, corresponding to the number of repetitions to be done for calibration/validation splitting. (default = 10) |
Data split percentage (data_split) | Numeric value between 0 and 100, corresponding to the percentage of data used to calibrate the models (calibration/validation splitting). (default = 100) |
prevalence | Allows to give more or less weight to particular observations; default = NULL: each observation (presence or absence) has the same weight; if value < 0.5: absences are given more weight; if value > 0.5: presences are given more weight. (algorithm parameter) |
Variable importance (var_import) | Integer value, corresponding to the number of permutations to be done for each variable to estimate variable importance. (default = 0) |
Scale models (rescale_all_models) | A logical value defining whether all models predictions should be scaled with a binomial GLM or not. (default = FALSE) |
Evaluate all models (do_full_models) | A logical value defining whether models calibrated and evaluated over the whole dataset should be computed or not. (default = TRUE) |
Regression type (type)
| Type of regression to model: linear ("simple"), quadratic or polynomial. (default = quadratic)
|
Variable interaction (interaction_level)
| Number of interactions between predictor variables that need to be considered. (default = 0 )
|
Test fit model (test)
| Criteria that should be used to test the fit of the model in stepwise predictor selection; if 'none' the stepwise procedure will be switched off. (default = AIC )
|
Family (family)
| Description of the error distribution of the response variable and the link function used in the model. (default = binomial)
|
Mean start value (mustart)
| Starting values for the vector of means. (default = 0.5 )
|
Tolerance (control_epsilon)
| Positive convergence tolerance. (default = 1e-8)
|
Maximum interactions (control_maxit)
| The maximum number of IWLS iterations to find maximum likelihood estimates. (default = 25 )
|
Interaction output (control_trace)
| If output should be produced for each IWLS iteration. (default = FALSE )
|
References
- Elith, J., H. Graham, C., P. Anderson, R., Dudík, M., Ferrier, S., Guisan, A., J. Hijmans, R., Huettmann, F., R. Leathwick, J., Lehmann, A., Li, J., G. Lohmann, L., A. Loiselle, B., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., McC. M. Overton, J., Townsend Peterson, A., … E. Zimmermann, N. (2006). Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29(2), 129–151.
- Franklin, J. (2010). Mapping species distributions: spatial inference and prediction. Cambridge University Press.
- Guisan, A., Edwards, T. C., & Hastie, T. (2002). Generalized linear and generalized additive models in studies of species distributions: Setting the scene. Ecological Modelling, 157(2), 89–100.
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. 2nd edition, Springer.
Additional Reading
- Adams, M. P., Saunders, M. I., Maxwell, P. S., Tuazon, D., Roelfsema, C. M., Callaghan, D. P., Leon, J., Grinham, A. R., & O’Brien, K. R. (2016). Prioritizing localized management actions for seagrass conservation and restoration using a species distribution model. Aquatic Conservation: Marine and Freshwater Ecosystems, 26(4), 639–659.
- Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE, 13(6), e0199292.
- Bosch, S., Tyberghein, L., Deneudt, K., Hernandez, F., & De Clerck, O. (2018). In search of relevant predictors for marine species distribution modelling using the MarineSPEED benchmark dataset. Diversity and Distributions, 24(2), 144–157.
- Dalmaris, E., Ramalho, C. E., Poot, P., Veneklaas, E. J., & Byrne, M. (2015). A climate change context for the decline of a foundation tree species in south-western Australia: Insights from phylogeography and species distribution modelling. Annals of Botany, 116(6), 941–952.
- De Clercq, E. M., Leta, S., Estrada-Peña, A., Madder, M., Adehan, S., & Vanwambeke, S. O. (2015). Species distribution modelling for Rhipicephalus microplus (Acari: Ixodidae) in Benin, West Africa: Comparing datasets and modelling algorithms. Preventive Veterinary Medicine, 118(1), 8–21.
- Deblauwe, V., Droissart, V., Bose, R., Sonké, B., Blach-Overgaard, A., Svenning, J.-C., Wieringa, J. J., Ramesh, B. R., Stévart, T., & Couvreur, T. L. P. (2016). Remotely sensed temperature and precipitation data improve species distribution modelling in the tropics. Global Ecology and Biogeography, 25(4), 443–454.
- Ducci, L., Agnelli, P., Di Febbraro, M., Frate, L., Russo, D., Loy, A., Carranza, M. L., Santini, G., & Roscioni, F. (2015). Different bat guilds perceive their habitat in different ways: A multiscale landscape approach for variable selection in species distribution modelling. Landscape Ecology, 30(10), 2147–2159.
- Feuda, R., Bannikova, A. A., Zemlemerova, E. D., Di Febbraro, M., Loy, A., Hutterer, R., Aloise, G., Zykov, A. E., Annesi, F., & Colangelo, P. (2015). Tracing the evolutionary history of the mole, Talpa europaea, through mitochondrial DNA phylogeography and species distribution modelling. Biological Journal of the Linnean Society, 114(3), 495–512.
- Golding, N., & Purse, B. V. (2016). Fast and flexible Bayesian species distribution modelling using Gaussian processes. Methods in Ecology and Evolution, 7(5), 598–608.
- Greiser, C., Hylander, K., Meineri, E., Luoto, M., & Ehrlén, J. (2020). Climate limitation at the cold edge: Contrasting perspectives from species distribution modelling and a transplant experiment. Ecography, 43(5), 637–647.
- Iturbide, M., Bedia, J., Herrera, S., del Hierro, O., Pinto, M., & Gutiérrez, J. M. (2015). A framework for species distribution modelling with improved pseudo-absence generation. Ecological Modelling, 312, 166–174.
- Jarnevich, C. S., & Young, N. E. (2019). Not so Normal Normals: Species Distribution Model Results are Sensitive to Choice of Climate Normals and Model Type. Climate, 7(3), 37.
- La Marca, W., Elith, J., Firth, R. S. C., Murphy, B. P., Regan, T. J., Woinarski, J. C. Z., & Nicholson, E. (2019). The influence of data source and species distribution modelling method on spatial conservation priorities. Diversity and Distributions, 25(7), 1060–1073.
- Lentini, P. E., & Wintle, B. A. (2015). Spatial conservation priorities are highly sensitive to choice of biodiversity surrogates and species distribution model type. Ecography, 38(11), 1101–1111.
- Mateo, R. G., Gastón, A., Aroca-Fernández, M. J., Saura, S., & García-Viñas, J. I. (2018). Optimization of forest sampling strategies for woody plant species distribution modelling at the landscape scale. Forest Ecology and Management, 410, 104–113.
- Niamir, A., Skidmore, A. K., Muñoz, A.-R., Toxopeus, A. G., & Real, R. (2019). Incorporating knowledge uncertainty into species distribution modelling. Biodiversity and Conservation, 28(3), 571–588.
- Oyafuso, Zack. S., Drazen, J. C., Moore, C. H., & Franklin, E. C. (2017). Habitat-based species distribution modelling of the Hawaiian deepwater snapper-grouper complex. Fisheries Research, 195, 19–27.
- Phillips, N. D., Reid, N., Thys, T., Harrod, C., Payne, N. L., Morgan, C. A., White, H. J., Porter, S., & Houghton, J. D. R. (2017). Applying species distribution modelling to a data poor, pelagic fish complex: The ocean sunfishes. Journal of Biogeography, 44(10), 2176–2187.
- Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology, 48(12), 1549–1555.
- Rodríguez-Rey, M., Consuegra, S., Börger, L., & Leaniz, C. G. de. (2019). Improving Species Distribution Modelling of freshwater invasive species for management applications. PLOS ONE, 14(6), e0217896.
- Rose, P. M., Kennard, M. J., Moffatt, D. B., Sheldon, F., & Butler, G. L. (2016). Testing Three Species Distribution Modelling Strategies to Define Fish Assemblage Reference Conditions for Stream Bioassessment and Related Applications. PLOS ONE, 11(1), e0146728.
- Schliep, E. M., Lany, N. K., Zarnetske, P. L., Schaeffer, R. N., Orians, C. M., Orwig, D. A., & Preisser, E. L. (2018). Joint species distribution modelling for spatio-temporal occurrence and ordinal abundance data. Global Ecology and Biogeography, 27(1), 142–155.
- Vacchiano, G., & Motta, R. (2015). An improved species distribution model for Scots pine and downy oak under future climate change in the NW Italian Alps. Annals of Forest Science, 72(3), 321–334.
- Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators, 104, 333–340.