Flexible Discriminant Analysis SDM explained

Modified on Thu, 23 Feb, 2023 at 1:12 PM

Introduction

Flexible discriminant analysis (FDA) is a general methodology that aims at providing tools for multigroup non-linear classification. It is a classification model based on a mixture of non-parametric regression models e.g. MARS and linear discriminant analysis.

The first step of an FDA is a non-parametric regression, which uses optimal scoring to transform the response variable so that the data are in a better form for linear separation. It builds multiple regression models, so-called basis functions (BF), across the range of predictor values. In this procedure, the range of predictor values is partitioned into several groups/ categories.

Inserting image...

In the second step of an FDA the groups identified in the first step are used to run a linear discriminant analysis. Linear discriminant analysis focuses on maximising the seperatibility among groups, while minimising the variance within each group.

Inserting image...

The first axis that LDA creates (environmental predictor 1) accounts for the most variation between the groups. The second axis (environmental predictor 2) accounts for the second most variation between the groups. This continues until every predictor is ranked. For simplicity reason only a 2-dimensional graph with 2 predictors (axis) is displayed at one time.

Advantages

Works well with a large number of predictor variables
Automatically detects interactions between variables
It is an efficient and fast algorithm, despite its complexity
Robust to outliers

Limitations

Strong sensitivity to configuration setting
Susceptible to overfitting
More difficult to understand and interpret than other methods
The response variable or grouping variable can be categorical, but independent variables are continuous, assumed to be normal.
Assumptions

No assumptions are made about the distributions of the environmental variables. However, they should not be highly correlated with one another because this could cause problems with the estimation.

Requires absence data

Yes.

Configuration options

Biosecurity Commons allows the user to set model arguments as specified below.

random_seed	Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation.
nb_run_eval	The original dataset is split in two, one to calibrate and another to calibrate the model. You can repeat this process ‘N’ times - This is called n-fold cross validation. (default = 10)
data_split	The proportion of data used for model calibration. Allows robust tests when independent data is not available. (default = 100)
prevalence	Allows giving of more or less weight to particular observations. If this option is kept to NULL (default), each observation (presence or absence) has the same weight independent of the number of presences and absences. If the value is set below 0.5 absences are given more weight, whereas a value above 0.5 gives more weight to presences. However, when pseudo-absence data have been generated weights (prevalence) are by default 0.5, as you should not give a higher value to pseudo-absence than presences. The model will not run if prevalence is set to 0.7 for example, as we are using pseudo-absence.
var_import	Number of permutations to estimate the importance of each variable. If this value is larger than 0, the algorithm will produce an object called ‘variabImprortance.Full.csv’, in which high values mean that the predictor variable has a high importance, whereas a value close to 0 corresponds to no importance. (default = 0)
rescale_all_models	If true, all model prediction will be scaled with a binomial GLM. i.e. values between 1 and 0. For ‘FDA’ and ‘ANN’, categorical models need to be scaled. In this case, it is recommended to scale all models computed to ensure comparable projections. However, it is not advised in other cases, as it reduces the projection scale amplitude. (default = FALSE)
do_full_models	Calibrate & evaluate models with the whole dataset (default = TRUE)
method	The regression method used in optimal scaling. The default is Multiple Adaptive Regression Splines (default = MARS)

References

Hallgren, W., Santana, F., Low-Choy, S., Zhao, Y., & Mackey, B. (2019). Species distribution models can be highly sensitive to algorithm configuration. Ecological Modelling, 408, 108719.
Hastie, T., Friedman, J., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer.
Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible Discriminant Analysis by Optimal Scoring. Journal of the American Statistical Association, 89(428), 1255–1270.
Reynès, C., Sabatier, R., & Molinari, N. (2006). Choice of B-splines with free parameters in the flexible discriminant analysis context. Computational Statistics & Data Analysis, 51(3), 1765–1778.
Thuiller, W., Georges, D., Gueguen, M., Engler, R., & Breiner, F. (2021). biomod2: Ensemble Platform for Species Distribution Modeling (3.5.1). https://CRAN.R-project.org/package=biomod2

Additional Reading

Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE, 13(6), e0199292.
Phillips, N. D., Reid, N., Thys, T., Harrod, C., Payne, N. L., Morgan, C. A., White, H. J., Porter, S., & Houghton, J. D. R. (2017). Applying species distribution modelling to a data poor, pelagic fish complex: The ocean sunfishes. Journal of Biogeography, 44(10), 2176–2187.
Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology, 48(12), 1549–1555.
Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators, 104, 333–340.