Potential species distribution modeling and the use of principal component analysis as predictor variables

Lauro LÓpez-mata

Availability and improved access to high-resolution digital terrain models (DTM) enables new approaches for the analysis of spatially explicit biological data. In this study, the spatial distribution of 16 tree species in a tropical mountain rain forest in South Ecuador and its relationship with topographic variables was evaluated at a fine-scale ecological level using two presence-only species distribution modelling techniques: The maximum entropy model (Maxent) and the ecological niche factor analysis (ENFA). Spatially explicit tree data stem from long-term forest monitoring plots in three microcatch-ments with a total area of 11.1 ha. Topographic variables were derived from a high-resolution DTM. Model performance was assessed by the true skill statistic (TSS) and area under curve (AUC) of the receiver operator characteristic (ROC), using both a k-fold approach and null-models. Performance varied among species and techniques, but generally Maxent models showed better performance than ENFA models. Furthermore, the ecological plausibility of the models was confirmed by comparing them with a previously established forest type classification. Among the explanatory topographic variables, elevation and a Topographic Position Index (TPI) appear as the main determinants for the distribution of most of the tree species. This study demonstrates that even on a small scale, the use of presence-only species distribution modelling techniques is a viable option for modelling suitable habitat for tree species in tropical mountain rain forests, indicating suitability for supporting stand-level planning and site-species matching techniques for natural forest management. Zusammenfassung: Die zunehmende Verfügbarkeit von hochauflösenden digitalen Geländemodellen ermöglicht neue Ansätze zur Analyse von räumlich expliziten biologischen Daten. In dieser Studie haben wir die räumliche Verteilung von 16 Baumarten und deren Zusammenhang mit topographischen Variablen auf einer feinskaligen ökologischen Ebene in einem tropischen Bergregenwald in Südecuador evaluiert. Dafür haben wir zwei auf Präsenzdaten basierende Techniken zur Ha-bitatmodellierung angewandt: Die Maximum-Entropie Methode (Maxent) und " Ecological Niche Factor Analysis " (ENFA). Räumlich explizite Baumdaten stammen aus Langzeitbeobachtungsflächen von drei kleinen Wassereinzugsgebieten mit ei-ner Fläche von 11.1 ha. Topographische Variablen wurden aus einem hochauflösenden digitalen Geländemodel abgeleitet. Zur Bewertung der Modelle wurden die True Skill Statistic (TSS) und die Fläche unterhalb der Kurve der Receiver Operating Characteristic (ROC) für sowohl k-fache Kreuzvalidierungen als auch Null-Modelle berechnet. Die Güte der Modelle hat zwischen den beiden Techniken und verschiedenen Arten variiert, aber generell schnitten Maxent-Modelle besser als ENFA-Modelle ab. Des Weiteren haben wir die ökologische Plausibilität der Modelle überprüft, indem wir sie mit einer bereits be-stehenden Waldtypenklassifizierung verglichen haben. Unter den erklärenden topographischen Variablen scheinen die Höhe über dem Meeresspiegel und ein Topographischer Positions Index (TPI) die bestimmenden Faktoren für die Verteilung der meisten Baumarten zu sein. Diese Studie belegt, dass selbst auf kleinen Skalen auf Präsenzdaten basierende Techniken zur Habitatmodellierung angewandt werden können, um geeignete Habitate für Baumarten in tropischen Bergregenwäldern abzubilden. Dies deutet darauf hin, dass diese Techniken zur Unterstützung von Planungen auf Bestandesebene und zur Auswahl von geeigneten Standorten für Baumarten im Rahmen der Bewirtschaftung von Naturwäldern geeignet sind.

Ordinary least-square (OLS) regression is fundamental to quantitative research in many ecological disciplines. However, spatially explicit methods have recently been proposed that allow the incorporation of spatial autocorrelation into ecological models. We compared the spatial error simultaneous autoregressive model (SARerr) and generalized least squares regression (GLS) with the results of simple and multiple OLS regressions, to analyze the relationship between white-tailed deer (Odocoileus virginianus) population density and environmental conditions in two regions dominated by tropical dry forests in central Mexico. The spatially explicit methods presented better goodness of fit than the OLS regression; we also observed a miscalculation in the probabilities obtained with the OLS regression, which in this method led to an incorrect interpretation. In general, we suggest the application of spatially explicit methods to analyze species-habitat relationships when SAC is observed in model residuals. We also discuss the management implications of these results. Resumen En los últimos años se han propuesto diversos métodos conocidos como espacialmente explícitos que permiten incorporar la autocorrelación espacial en los modelos. En este estudio, aplicamos modelos autorregresivos simultáneos (SARerr) y generalización de mínimos cuadrados (GLS) para analizar la relación entre la densidad poblacional del venado cola blanca Odocoileus virginianus y un conjunto de variables de hábitat en dos regiones dominadas por bosque tropical seco en el centro de México. Contrastamos los resultados con los obtenidos con la regresión simple y múltiple por cuadrados mínimos ordinarios (OLS), que es el método comúnmente usado en este tipo de estudios. Los métodos espacialmente explícitos tuvieron un mejor desempeño que la regresión OLS en términos de bondad de ajuste. Por otro lado, observamos un cálculo incorrecto en las probabilidades obtenidas con la regresión OLS lo que en este método nos llevó a una mala interpretación de cuáles son las variables asociadas con la densidad de esta especie. Recomendamos, de manera general la aplicación de métodos espacialmente explícitos para analizar las relaciones especie-hábitat cuando se observe autocorrelación espacial en los residuales de los modelos. Discutimos las implicaciones de manejo. Palabras clave: relación especie-hábitat, modelos espacialmente explícitos, modelo autorregresivo.

Revista Mexicana de Biodiversidad 85: 189-199, 2014 DOI: 10.7550/rmb.36723 Revista Mexicana de Biodiversidad 85: 189-199, 2014 DOI: 10.7550/rmb.36723 189 Potential species distribution modeling and the use of principal component analysis as predictor variables Modelado de la distribución potencial de especies y el uso del análisis de componentes principales como variables predictoras Gustavo Cruz-Cárdenas1, Lauro López-Mata2 , José Luis Villaseñor3 and Enrique Ortiz3 1 Centro Interdisciplinario de Investigación para el Desarrollo Integral Regional-Instituto Politécnico Nacional-Michoacán, COFAA. Justo Sierra 28, 59510 Jiquilpan, Michoacán, Mexico. 2 Posgrado en Botánica, Colegio de Postgraduados. Km. 36.5 Carretera Federal México-Texcoco. Montecillo, 56230 Texcoco, Estado de México, Mexico. 3 Departamento de Botánica, Instituto de Biología, Universidad Nacional Autónoma de México. Apartado postal 70-367, 04510 México, D. F., Mexico. laurolopezmata@gmail.com; lauro@colpos.mx Abstract. Prior to modeling the potential distribution of a species it is recommended to carry out analyses to reduce errors in the model, especially those caused by the spatial autocorrelation of presence data or the multi-collinearity of the environmental predictors used. This paper proposes statistical methods to solve drawbacks frequently disregarded when such models are built. We use spatial records of 3 species characteristic of the Mexican humid mountain forest and 2 sets of original variables. The selection of presence-only records with no autocorrelation was made by applying both randomness and pattern analyses. Through principal component analysis (PCA) the 2 sets of original variables were transformed into 4 different sets to produce the species distribution models with the modeling application in Maxent. Model precision was higher than 90% applying a binomial test and was always higher than 0.9 with the area under the curve (AUC) and with the partial receiver operating characteristic (ROC). The results show that the records selected with the randomness method proposed here and the use of the PCA to select the environmental predictors generated more parsimonious predictive models, with a precision higher than 95%, and in addition, the response variables show no spatial autocorrelation. Key words: randomness test, pattern analysis, spatial autocorrelation. Resumen. Cuando se modela la distribución potencial de una especie es deseable efectuar algunos análisis previos para reducir errores en el modelo resultante, especialmente los ocasionados por la autocorrelación espacial de los registros de presencia y la correlación entre los predictores ambientales utilizados. En este trabajo se proponen métodos estadísticos que sirven para resolver estos inconvenientes que con frecuencia se presentan al elaborar los modelos de distribución potencial. Se emplearon los registros de presencia de 3 especies características del bosque húmedo de montaña de México y 2 conjuntos de variables originales. A los datos de presencia se les aplicó un análisis de aleatoriedad y de patrones para seleccionar registros no autocorrelacionados. Mediante análisis de componentes principales (PCA), los 2 conjuntos de variables originales se transformaron en 4 conjuntos distintos para generar los modelos de distribución de especies utilizando el algoritmo Maxent. La precisión de los modelos fue mayor al 90% con una prueba binomial y mayor de 0.9 del área bajo la curva (AUC) con la característica operativa del receptor parcial (ROC). Los resultados muestran que la selección de registros por el método de aleatoriedad propuesto y el uso de componentes principales como predictores ambientales generan modelos predictivos más parsimoniosos, con una precisión mayor al 95%, además de que sus variables predictivas no presentan autocorrelación espacial. Palabras clave: prueba de aleatoriedad, análisis de patrón, autocorrelación espacial. Introduction The models that predict the potential distribution of species through the combination of presence-only records Recibido: 02 abril 2013; aceptado: 07 octubre 2013 and digital layers of environmental variables are of great interest in both theoretical and applied disciplines (Guisan and Thuiller, 2005; Elith and Leathwick, 2009a; Peterson et al., 2011). Such models use the association between environmental variables, presumably of predictive value, and the species occurrence records; thus are identified 190 Cruz-Cárdenas et al.- Distribution modeling and PCA as predictor variables the environmental conditions where a species could survive indefinitely (Pulliam, 2000; Guisan and Thuiller, 2005; Elith and Leathwick, 2009b). This approach is especially important to produce basic information for such disciplines as biogeography, conservation biology, ecology, evolutionary biology, and others (Guisan and Zimmermann, 2000; Guisan and Thuiller, 2005; Elith and Leathwick, 2009b; Peterson et al., 2011). Species distribution models implicitly suppose that the geographical data points for species records are independent, although this is not necessarily true. In addition, the environmental layers used as hypothetical predictive variables and associated to the geographical records of species also show problems of spatial autocorrelation. The spatial autocorrelation is the degree of dependency of variables in geographical space (Cressie, 1991; Legendre, 1993; Anselin et al., 2004); accordingly, disparity among variable values is strongly influenced by the distances among geographical data points where a species has been observed (Anselin et al., 2004; Segurado et al., 2006). Spatial autocorrelation represents an intrinsic characteristic in most of the geospatial data (Legendre, 1993; Segurado et al., 2006) and it can be an important bias in most geospatial analyses (Anselin et al., 2004). Spatial autocorrelation inflates type I errors of traditional statistics and it can affect the estimated parameters in model selection (Lennon, 2002). The species distribution models obtained from a large data set of associated environmental covariates often inherently result in multi-collinearity, a statistical problem defined as a high degree of correlation among covariates. Multi-collinearity is a serious statistical problem in nonexperimental situations, where the researcher has no control of the risk associated to hypothetical factors related to independent variables. Multi-collinearity is found, for instance, when many covariates are used as predictor variables to model selection and several of them measure similar phenomena. This is so because in most cases the researcher does not have a priori knowledge on which predictive environmental variables should be included in the model. However, the researcher must have a model in mind that usually includes a large number of predictive variables and hopes that using an appropriate statistical analysis will provide him/her with a correct model. It should be taken into account that multi-collinearity does not violate the assumptions that underlie to the statistical analysis, i.e., its presence does not affect the estimate of the dependent variable. In other words, estimation values for the dependent variable are the best unbiased estimates from the conditional population average. However, the existence of multi-collinearity tends to inflate both the variances of predicted values of the response variable and the variances of the estimated parameters. Therefore, if one considers that multi-collinearity is present in a dataset, it is important to know how the linear relationships are among the predictive environmental variables. For these reasons, it is critical both to the researcher as to the research to be sure that those environmental predictive variables are orthogonal to each other, that is, they are mutually independent. Species distribution models are not explicitly spatial (Franklin, 2009); they suppose that the geographical occurrences of records are mutually independent. However, this violates a fundamental principle of the spatial geography establishing that spatially proximate objects are similar and proximate localities tend to have similar values due to the possibility they reciprocally influence each other, or both are influenced by the same pattern that generates geographical processes (Franklin, 2009). Disregard and not avoid spatial autocorrelation has consequences, for example: a) it can increase the probability of incurring in type I errors or incorrectly rejecting the null hypothesis of no effect, b) variable selection may be predisposed toward more strongly auto-correlated predictors (Lennon, 2002), c) coarse scale predictors may be better selected against more locally influencing predictors, and d) model selection based on the Akaike information criterion will tend to model with larger number of predictive variables due to the committed residual variance structure. In summary, if spatial autocorrelation is present and ignored or not resolved, one may be incurring in a biased selection of variables or model coefficients. Among the statistical procedures proposed to solve or to reduce autocorrelation, principal component analysis (PCA), ridge-regression, and latent-root regression have been mentioned (Mason and Gunst, 1985; Afifi et al., 2012). The advantages of PCA compared with the other 2 procedures is the availability of an exact theory on estimate distributions, that is, the term or the error of the regression and the estimates are normally distributed (Gunst and Mason, 1977) and the principal components (PCs) are useful exploratory tools to detect and quantify mutual relationships among variables (Afifi et al., 2012). The reduction of dimensionality is among the many applications of the PCA, that is, the reduction to a number of predictive variables that retain a high proportion of the original information (Tabachnick and Fidell, 2007). The PCs obtained are placed hierarchically according to their variance size; consequently, the first PC explains the maximum variance recorded in the predictive variables, the second PC explains the maximum of the residual variance and so forth, until the last PC which explains the remainder variance (Tabachnick and Fidell, 2007). Since the first PCs are those that retain the highest proportion Revista Mexicana de Biodiversidad 85: 189-199, 2014 DOI: 10.7550/rmb.36723 of information, the dimensionality reduction is attained by choosing those first PCs, which explain a high percentage of variation recorded in the original data. The selection of the number of PCs is a function of the variance percentage that satisfies the standards of the research carried out. Another important property of obtained PCs is their independency of each other, that is, they are orthogonal. Accordingly, obtaining the PCs permits them to be used as independent, non-correlated variables in analyses of modeling potential species distribution. This contribution seeks to evaluate and propose statistical methods to resolve or minimize spatial autocorrelation, from both presence species records and environmental variables used in species distribution modeling. The methods proposed are then tested to model the potential distribution of 3 species characteristic of the humid mountain forest of Mexico (Cruz-Cárdenas et al., 2012). A contrasting number of environmental variables and number of presence records are used to validate the resulting models and to select the best one for each species using as criteria the reduction of spatial autocorrelation, the precision of the predicted potential distribution, the parsimony principle, and the surface of simulated occurrence. Materials and methods Predictive environmental variables. Table 1 shows the 58 presumably predictive environmental variables (pixel size 1 km2) used in the analysis, which include: 1) 19 climatic variables taken from WorldClim data base (Hijmans et al., 2005; http://www.worldclim.org/current.htm); 2) 7 seasonal climatic variables calculated from WorldClim data; 3) 9 variables of soil properties (Cruz-Cárdenas et al., 2014); 4) 8 topographic variables generated from a digital elevation model extracted from the GTOPO data base (http://edc.usgs.gov/products/elevation/gtopo30/gtopo30. html), and 5) 14 normalized vegetation indices, one for each month of the year 2009 obtained from remote sensing data (MODIS), and one normalized vegetation for the dry months, and another for the humid months. Considering studies by García (1965) and Mociño and García (1974), we selected from WorldClim data those variables that account for the distribution of the vegetation types in Mexico. For instance, based on conclusions by these authors, the distribution patterns of rainfall and temperature in Mexico are strongly influenced by orography, and by the atmospheric circulation at both low and high elevations. Rainfall in Mexico is heterogeneous throughout the year; it increases latitudinally from north to south, and is also strongly influenced by the presence of the Gulf of Mexico and the Pacific Ocean (García, 1965), 191 and by the link with the El Niño Southern Oscillation (Cavazos and Hastenrath, 1990). García (1965) and Mociño and García (1974) point out that about 70% of the rainfall in Mexico is recorded from May to October, and the remaining 30% between November and April, which define the humid and dry seasons, respectively. During the dry season, a proportion of species lose their foliage, producing physiognomic differences that characterize the different vegetation types of Mexico. The presence or absence of leaves during the humid or dry seasons should be reflected on the monthly normalized vegetation indices. Based on the works of the above authors, we made an a priori selection from the 58 variables listed in Table 1 discarding 38 and retaining 20 variables that presumably influence the potential distribution of plant species. These 20 variables are marked in Table 1 under the header a priori. Species presence records. Three species considered characteristic of the humid mountain forest of Mexico (Villaseñor, 2010) were selected for analysis: Liquidambar styraciflua L., Quercus rubramenta Trel., and Roldana robinsoniana (Greenm.) H. Rob. and Brettell. These species were selected based on the criteria by CruzCárdenas et al. (2012) in which the species are restricted to the HMF geographic polygon besides that the literature has also referred them as charismatic of that biome. The localities of occurrence were obtained from specimens housed in the Herbario Nacional de México (MEXU) at Instituto de Biología, UNAM. Data for 193 collecting localities were transformed to geographical coordinates, 142 of them represented L. styraciflua, 41 Q. rubramenta, and 10 R. robinsoniana. Species distribution modeling. Figure 1 illustrates the procedure to generate the species distribution models. Below, it is described in detail: 1) A randomness test was applied to the spatial records for each species (Bivand et al., 2008). If positive, 75% of the records were used for training the model and 25% for model validation. If negative, a pattern analysis was applied to the records by estimating the distance at which it is possible to find a single species record with a maximum probability. The pattern analysis was performed by using the public domain ILWIS 3.7 (http://52north.org/ilwis). Pattern analysis is similar to estimating the distance (or the range value of a variogram) for which the records do not show spatial autocorrelation (Hengl, 2007). The study area is divided in a grid cell system whose sides are the estimated distance value expressed in degrees; such a grid cells system is obtained by using the public Quantum GIS 1.7.4 software (http://qgis.osgeo.org). Finally, after the species records are randomly selected, a single record per grid cell is used to train the model. 192 Cruz-Cárdenas et al.- Distribution modeling and PCA as predictor variables Table 1. Original environmental predictive variables used in species distribution modeling. In bold, the a priori selected variables are indicated A priori selected CO= Organic carbon a) Climatic variables bio1= Annual mean temperature A priori selected X bio2= Mean diurnal range (Tmax - Tmin) K= Potasium MO= Organic material bio3= Isothermality (bio1/bio7) × 100 X Mg= Magnesium bio4= Temperature seasonality (standard deviation × 100) X Na= Sodium pH= Hydrogen potential bio5= Maximum temperature of warmest month RAS= Sodium absorption relationship bio6= Minimum temperature of coldest month d) Topografic attributes bio7= Temperature anual range (bio5–bio6) Aspect bio8= Mean temperature of wettest quarter Anisotropic heating bio9= Mean temperature of driest quarter Elevation bio10= Mean temperature of warmest quarter Runoff bio11= Mean temperature of coldest quarter bio12= Annual precipitation X X Topographic humidity index Terrain rugosity index bio14= Precipitation of driest month Vector rugosity measurement X Slope bio16= Precipitation of wettest quarter e) Normalized vegetation indices for 2009 bio17= Precipitation of driest quarter IVNENE = January normalized index bio18= Precipitation of warmest quarter IVNFEB = February normalized index bio19= Precipitation of coldest quarter IVNMAR = March normalized index b) Seasonality climatic variables X IVNABR = April normalized index ETRA= Real annual evapotranspiration X IVNMAY = May normalized index ETRAH= Real evapotranspiration of the humid months (may to october) X IVNJUN = June normalized index ETRAS Real evapotranspiration of the dry months (november to abril) X PPH= Precipitation of the humid months X IVNSEP = September normalized index PPS= Precipitation of the dry months X IVNOCT = October normalized index TH= Mean temperature of the humid months X IVNNOV = November normalized index TS= Mean temperature of the dry months X IVNDIC = December normalized index c) Soil properties Ca= Calcium CE= Electric conductivity X Convergence index X bio13= Precipitation of wettest month bio15= Precipitation seasonality X X IVNJUL = July normalized index IVNAGO = August normalized index IVNH = Humid months of year normalized index X IVNS = Dry months of year normalized index X Evapotranspiration was calculated using the Turc’s model (Turc, 1954): ETRA= P / [0.9 + (P/L)2]½, where, P= total annual precipitation (mm), L= 300 + 25T +0.05T3, and T= average annual temperature (° C). 2) The environmental layers were 7 principal components (PCs) obtained from the PCA for 20 environmental variables selected a priori for Mexico (Table 1). Such PCs explained more than 95% of variance of the original variables (Set I). 3) The 7 PCs and the training records were used for modeling the potential species distribution with the Maxent algorithm (http://www.cs.princeton.edu~schapire/ maxent/). Maxent requires presence-only records and a set of predictive environmental variables (Phillips et al., 2006); Revista Mexicana de Biodiversidad 85: 189-199, 2014 DOI: 10.7550/rmb.36723 Figure 1. Diagram indicating the proposed methodology used in this work to generate the species distribution models. because it is a nonparametric algorithm, it incorporates interactions among variables, produces continuous maps of suitability or compatibility (Phillips et al., 2006), and performs better than other methods that estimate potential species distributions with presence-only data (Elith et al., 2006). Maxent configuration was used by default (Phillips and Dudik, 2008), except for the “Extrapolate” and “Do clamping” modules that were deactivated; the output format was logistic. 4) Models obtained with Maxent were transformed to boolean layers (presence-absence) with a cutoff threshold equal to 10% omission errors (Pearson et al., 2007). 5) A binomial test was used for model validation to assess whether it was better than any other model randomly obtained (p > 0.5). The number of successes was obtained quantifying the number of records with logistic values above the cutoff threshold. 6) A partial Receiver Operating Characteristic (ROC) test was applied as an additional precision analysis (Peterson et al., 2008). This test estimates the relationship between the Area Under the Curve (AUC) and the null expectation of bootstrap repetitions; a Z test is the decision rule to determine if such a relationship is less than 1 (considered a good model). Additionally, the number of bootstrap replicas with AUC≤ 1 was counted. As a more appropriate alternative, we used partial ROC approaches 193 that weight omission error over commission error, such that the model evaluation is more appropriate to the challenge of predicting species’ distributions from incomplete, presence-only biodiversity data (Peterson et al., 2008). Species distribution models were also obtained with other 5 sets of environmental variables. The second set (Set II) consisted of 7 PCs that explained 95% or more of data variance and obtained through the PCA applied to 58 environmental variables (Table 1). The third set (Set III) included 7 PCs obtained with the nonspatial data for 20 variables selected a priori (Table 1). The fourth set (Set IV) was prepared with 58 variables and nonspatial data, but the PCA was elaborated from a data matrix with records selected with step 1 (above described) as rows and the environmental variables as columns, that is, the matrix consisted of records selected for training the model with step 1 as rows and the environmental variables as columns. All PCA analyses were carried out with R software (R Development Core Team, 2012). In each case, the variables selected were significantly correlated with the PC (95% confidence level). Thus, the linear combinations were made with the loading values of the selected variables, and for each of the 7 PCs their respective rasters were created. The sets V (20 variables) and VI (58 variables) included the original environmental predictors. Steps 3 (species distribution modeling), 4 (boolean layer built-up), 5 (binomial test), and 6 (partial ROC test) of the proposed methodology were similar for the 6 sets of environmental predictors. In brief, environmental predictors integrating sets I to III are 7 PCs generated with 20 original variables, sets II and IV are likewise integrated by 7 PCs generated with 58 original variables, and the sets V and VI included 20 and 58 original variables respectively (Table 1). Results Figure 2 shows the results of the randomness test applied to the records for each species. The analyses indicate that the records for the 3 species show aggregated spatial patterns, since the observed values (continuous line) are distributed up and out of the confidence bands estimated to any distance (r). Records of Q. rubramenta (Fig. 2B) display a strong aggregation since at distances lower than 6.5 km (0.06 degrees) the probability of finding a record is high (G(r) ≈ 0.9). In contrast, the records of R. robinsoniana (Fig. 2C) show a weak aggregation because at distances of about 77 km (< 0.4 degrees), the probability of finding a record is lower (G(r) ≈ 0.7). On the other hand, records of L. styraciflua (Fig. 2A) display an aggregated distribution at short distances and a random distribution pattern at intermediate distances, with a maximum 194 Cruz-Cárdenas et al.- Distribution modeling and PCA as predictor variables Figure 2. Randomness test for recorded sites of Liquidambar styraciflua (A), Quercus rubramenta (B), and Roldana robinsoniana (C). G(r)= average number of records inside a radius r, equivalent to distance in degrees. The continuous line corresponds to observed values, the discontinuous line to the theoretical values, and the gray area to the confidence band. probability (G(r) ≈ 0.9) of finding a record at distances of about 16 km (G(r) ≈ 0.15). Figure 3 shows the pattern analyses for the records of the 3 species. Based on these analyses, cutoff distances resulted with maximum likelihood of species randomly distributed were 1.5, 0.3, and 3.9 degrees for L. styraciflua, Q. rubramenta, and R. robinsoniana respectively. Records selected by this process were 16 for L. styraciflua, 7 for Q. rubramenta, and 5 for R. robinsoniana, which when applied, satisfied their randomness test. However, it is important to point out that the records selected with this method for R. robinsoniana were only 2; therefore 3 additional records were randomly selected to reach the minimum number of records (n= 5). Figure 3 (D, Figure 3. Pattern analyses applied to species records: Liquidambar styraciflua (A), Quercus rubramenta (B), Roldana robinsoniana (C). Ramdomness test to selected records of these species for model training: Liquidambar styraciflua (D), Quercus rubramenta (E), Roldana robinsoniana (F). E y F) shows the observed values distributed into the confidence bands, which ensures the randomly distribution pattern of records for the 3 species. Finally, the number of records used to validate the distribution models were 35 for L. styraciflua, 10 for Q. rubramenta, and 5 for R. robinsoniana. Figure 4 shows the species distribution models and the areas of occupancy predicted by the models. Predicted areas are low, intermediate, and high. Models that predicted low geographical areas were those that used 20 and 58 variables as environmental predictors corresponding to sets V y VI (Fig. 4). In contrast, models that predict wide geographical areas were those whose simulations derived from the use of 7 PC, that is, set I to IV (Fig. 4). Set I predicted the smaller area as compared with the others. The relative contribution of environmental predictors, represented by the orthogonal arrangement of PCs were: PC1 and PC4 for L. styraciflua (Table 2) that explain Revista Mexicana de Biodiversidad 85: 189-199, 2014 DOI: 10.7550/rmb.36723 Figure 4. Species distribution models for Liquidambar styraciflua (row 1), Quercus rubramenta (row 2), and Roldana robinsoniana (row 3). Roman numbers in columns (I-VI) correspond to the set of environmental predictors used in species distribution modeling. 195 196 Cruz-Cárdenas et al.- Distribution modeling and PCA as predictor variables Table 2. Number of principal components retained and their relative contribution to modeling distribution of species resulting from Maxent outputs. The most important principal components for each species are highlighted in bold face Environmental predictor Liquidambar styraciflua Quercus rubramenta Roldana robinsoniana PC1 40.4 57 66.9 PC2 17.8 8.9 7 PC3 9.5 29.9 22.2 PC4 30.6 0.3 3.2 PC5 1.5 0 0.7 PC6 0 0 0 PC7 0.2 3.9 0 71% of total variance registered in the original variables, and PC1 and PC3 both for Q. rubramenta and for R. robinsoniana, which contributed to explain 86.9% and 89.1%, respectively. The most important original variables included in the PCs of each species were selected based on their highest loading values (Table 3). Elevation, Table 3. Loading factors of the predictive environmental variables used in the principal component analysis Original variable PC1 PC3 PC4 Aspect 0.311 −0.241 0.182 bio1 0.204 0.251 0.121 bio3 0.210 0.346 0.516 bio4 0.156 0.068 0.002 bio12 0.162 −0.24 −0.192 bio15 0.108 0.005 −0.069 Soil electric conductivity 0.107 −0.16 −0.31 Elevation 0.323 −0.268 0.134 ETRA 0.278 −0.192 0.231 ETRAH 0.186 −0.021 0.014 ETRAS IVNH 0.2 0.036 0.066 0.111 −0.166 −0.307 IVNS 0.356 0.196 −0.132 Soil organic material 0.337 −0.147 −0.117 Slope 0.197 0.117 0.206 Soil pH 0.122 −0.257 −0.193 PPH 0.2 0.05 −0.186 PPS 0.159 0.614 −0.47 TH 0.258 0.119 −0.005 TS 0.223 0 −0.11 PC= Number of principal components from PCA with set I of variables. The acronyms at first column are indicated in Table 1. Highest loading values are highlighted in bold face. normalized vegetation index of dry months, organic matter, isothermality, and rainfall for the dry months of the year turned out to be the most important variables to explain the potential distribution for the 3 species. The first 3 variables were used to generate the environmental space (Fig. 5) of the occurrence of the 3 species in the humid mountain forest of Mexico (Cruz-Cárdenas et al., 2012). The binomial test determined that all species distribution models were better that any others randomly obtained (p > 0.5). The precision of generated models for L. styraciflua with the set I of environmental layers (97%), set II (90%), and set IV (94%) were not statistically different; precision of these 3 sets was higher than that of sets III (77%), V (68%), and VI (88%). On the other hand, precision of distribution models for Q. rubramenta and R. robinsoniana with all the 6 sets of environmental predictors were statistically similar because they included 100% of success with validation records. Discussion The evaluation of spatial distribution patterns of species records is important to avoid errors caused by spatial autocorrelation (Dormann, 2007). If the researcher does not avoid or minimize the spatial autocorrelation before the selection of variables, there may be negative consequences in the modeling analysis among them, and perhaps the most important, is the skewed selection of the environmental variables assumed as hypothetically predictive (Lennon, 2002). Our results show that randomness and spatial pattern analyses are good statistical methods to eliminate or minimize spatial autocorrelation shown in the species records. Following Phillips et al. (2006), relative contribution of environmental variables to species distribution should be taken carefully, especially if we are not certain that these variables are or not spatially correlated. In this respect, the strength of this contribution is that the environmental variables submitted to the PCA, Revista Mexicana de Biodiversidad 85: 189-199, 2014 DOI: 10.7550/rmb.36723 Figure 5. Environmental space of the 3 species built with the organic matter (%), normalized vegetation index (INV), and altitude (m) variables. Liquidambar styraciflua (A), Quercus rubramenta (B), Roldana robinsoniana (C). In black it is indicated the environmental space of the species and the gray color shows the environmental space of the humid mountain forest in Mexico. 197 resulted in PCs that constitute orthogonal projections of the transformed variables and consequently free of autocorrelation. The models of distribution generated with raw environmental predictors generally predict a smaller occupancy surface. That is so, because these models must satisfy a large number of rules, so that, a given record can be classified as a true presence. It is evident that predicted occupancy surface may have problems when hypothetical environmental variables are correlated, as is the case of those climatic variables mostly generated from rainfall and temperature data. The autocorrelation problem was notably reduced by using PCA, especially due to the property of orthogonal transformation of variables. As a result, each PC is a linear combination of the original predictive variables such that the new variables decrease in hierarchical order allowing the selection of a reduced number of PCs that explain a high percentage of variance of the variables. PCA used along with the highest values of the loading factors is not the single criterion for the selection of variables; others include the grouping criterion and the combination of correlation and the highest average of the loading factors (Al-Kandari and Jolliffe, 2012). Distribution models obtained for the 3 species and their predicted occupancy area was consistently higher when sets of variables II and IV were used (except for L. styraciflua). This is because the environmental layers used are influenced by the linear combinations from which they were generated. Moreover, models generated for L. styraciflua and R. robinsoniana with sets I and II (resulted with) have the same statistical precision but are different from those obtained with sets III and IV. With the exception of the distribution model for R. robinsoniana, the use of set II predicts the species distribution in the Yucatán Peninsula; such projections are incorrect since there is no evidence that the humid mountain forest ever existed in that region. Our results show that distribution models obtained with the set of predictors number I were the most parsimonious and suitable; their predicted variables lack spatial autocorrelation and although the predicted surface area includes patches of L. styraciflua extending into the Yucatán Peninsula (mostly south of Campeche and Quintana Roo), precision was always higher that 95%. The predicted occurrence of species in the Yucatán Peninsula perhaps may be explained by historical factors (Luna-Vega and Magallón, 2010); maybe its presence was factual in remote times when climate was mild and later the species reduced their geographical distribution due to recent climatic fluctuations. The 3 selected species used in the distribution modeling 198 Cruz-Cárdenas et al.- Distribution modeling and PCA as predictor variables analyses are preferably distributed in the humid mountain forest of Mexico, although they are not strictly restricted to this biome. Accordingly, the environmental space of this biome determines the realized niches of these 3 species, since it represents the suitable environmental, ecological, and biological conditions where they would be able to coexist indefinitely. In addition, the environmental space given by the range of tolerances to the environmental combinations that assure species long-term survival configure their fundamental niches. In consequence, it is evident that the fundamental niche of the 3 species is wider than their realized niche, such as it is expected (Soberón and Peterson, 2011). The results obtained by using the randomness and spatial pattern analyses reduced or eliminated autocorrelation among species records, and suggest that they are adequate statistical methods to solve that problem. On the other hand, the use of PCA and variable selection based on the highest loading factors of each PC guarantee that this diminishes the autocorrelation of variables before modeling the distribution of species. The validation of the resulting models as well as the analyses of the statistical similarities or differences among them demonstrate their relevance to help us to select the best model. In this respect, it was decided that the best models were those that met the principle of parsimony, namely the simplest, with the smaller number of explanatory variables, with either statistical or ecological significance, and with maximum feasibility of occupancy surface area. As a corollary, before modeling of both real and potential species distribution the researcher must decide whether the suggestions proposed in this paper should be put into practice. Acknowledgments We thank Pedro D. Maeda and Francisco Espinosa for the reading of a preliminary version of this paper and their suggestions to improve the manuscript. Claudio Delgadillo read the English version of the manuscript and his suggestions were of great help to improve it. Joselin Cadena redrew Figure 1. The first author thanks the UNAM for the fellowship granted in its DGAPA (Dirección General de Asuntos del Personal Académico) 2011 Postdoctoral Fellowships Program. The Institute of Biology, UNAM provided human and material resources to carry out this research. Literature cited Afifi, A., S. May and V. A. Clark. 2012. Practical multivariate analysis (Fifth edition). CRC Press, Taylor and Francis Group, Boca Raton. 517 p. Al-Kandari, N. M. and I. T. Jolliffe. 2012. Variable selection and interpretation of covariance principal components. Communications in Statistics-Simulation and Computation 30:339-354. Anselin, L., R. G. Bongiovanni and J. Lowenberg-DeBoer. 2004. A spatial econometric approach to the economics of siteSpecific nitrogen management in corn production. American Journal of Agricultural Economics 86:675-687. Bivand, R., E. Pebesma and V. Gómez-Rubio. 2008. Applied spatial data analysis with R. Spring, New York. 378 p. Cavazos, T. and S. Hastenrath. 1990. Convection and rainfall over Mexico and their modulation by the southern oscillation. International Journal of Climatology 10:377-386. Cressie, N. 1991. Statistics for spatial data. John Wiley and Sons. New York. 920 p. Cruz-Cárdenas, G., J. L. Villaseñor, L. López-Mata and E. Ortiz. 2012. Potential distribution of Humid Mountain Forest in Mexico. Botanical Sciences 90:331-340. Cruz-Cárdenas, G., L. López-Mata, C. A. Ortiz-Solorio, J. L. Villaseñor and E. Ortiz. 2014. Interpolation of Mexican soil properties at a scale of 1:1,000,000. Geoderma 213:29-35. Dormann, C. 2007. Effects of incorporating spatial autocorrelation into the analysis of species distribution data. Global Ecology and Biogeography 16:129-138. Elith, J., H. C. Graham, P. R. Anderson, M. Dudík, S. Ferrier, A. Guisan, R. J. Hijmans, F. Huettmann, J. R. Leathwick, A. Lehmann, J. Li, L. G. Lohmann, B. A. Loiselle, G. Manion, C. Moritz, M. Nakamura, Y. Nakazawa, J. McC. Overton, A. T. Peterson, S. J. Phillips, K. Richardson, R. ScachettiPereira, R. E. Schapire, J. Soberón, S. Williams, M. S. Wisz and N. E. Zimmermann. 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29:129-151. Elith, J. and J. R. Leathwick. 2009a. The contribution of species distribution modelling to conservation prioritization. In Spatial conservation prioritization: quantitative methods and computational tools, A. Moilanen, K.A. Wilson and H.P. Possingham (eds.). Oxford University Press. Oxford and New York. 304 p. Elith, J. and J. R. Leathwick. 2009b. Species distribution models: Ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution and Systematics 40:677-697. Franklin, J. 2009. Mapping species distributions, spatial inference and prediction. Cambridge University Press. Cambridge. 320 p. García, E. 1965. Distribución de la precipitación en la República Mexicana. Publicaciones del Instituto de Geografía 1:171-191. Guisan, A. and N. E. Zimmermann. 2000. Predictive habitat distribution models in ecology. Ecological Modeling 135:147-186. Guisan, A. and W. Thuiller. 2005. Predicting species distribution: offering more than simple habitat models. Ecology Letters 8:993-1009 Revista Mexicana de Biodiversidad 85: 189-199, 2014 DOI: 10.7550/rmb.36723 Gunst, R. F. and R. L. Mason. 1977. Biased estimation in regression: an evaluation using mean squared error. Journal of the American Statistical Association 72:616-628. Hengl, T. 2007. A practical guide to geostatistical mapping of environmental variables. European Commission, Joint Research Centre, Institute for Environment and Sustainability. Italy. 143 p. Hijmans, R. J., S. E. Cameron, P. G. Parra and A. Jarvis. 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25:1965-1978. Legendre, P. 1993. Spatial autocorrelation: problem or new paradigm. Ecology 74:1659-1673. Lennon, J. J. 2002. Red shifts and red herrings in geographical ecology. Ecography 23:101-113. Luna-Vega, I. and S. Magallón. 2010. Phylogenetic composition of angiosperm diversity in the cloud forests of Mexico. Biotropica 42:444-454. Mason, R. L. and R. F. Gunst. 1985. Selection principal components in regression. Statistics and Probability Letters 3:299-301. Mociño, P. A. and E. García. 1974. The climate of Mexico. In Climates of North America, R. A. Byrson and F. K. Hare (eds.). Elsevier. London. p. 345-404. Pearson, R. G., C. J. Raxworthy, M. Nakamura and A. T. Peterson. 2007. Predicting species distribution from small numbers of occurrence records: a test case using cryptic geckos in Madagascar. Journal of Biogeography 34:102-117. Peterson, A. T., M. Papes and J. Soberón. 2008. Rethinking receiver operating characteristic analysis. Ecological Modelling 213:63-72. 199 Peterson, A. T., J. Soberón, R. G. Pearson, R. P. Anderson, E. Martínez-Meyer, M. Nakamura and M. B. Araújo. 2011. Ecological niches and geographical distributions. Monographs in Population Biology 49. Princeton University Press, Princeton, New Jersey. 328 p. Phillips, S. J., R. Anderson and R. Schapire. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190:231-259. Phillips, S. J. and M. Dudik. 2008. Modeling of species distributions with MaxEnt: new extensions and a comprehensive evaluation. Ecography 31:161-175. Pulliam, H. R. 2000. On the relationship between niche and distribution. Ecology Letters 3:349-361. R Development Core Team. 2012. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Segurado, P., M. B. Araújo and W. E. Kunin. 2006. Consequences of spatial autocorrelation for niche-based models. Journal of Applied Ecology 43:433-444. Soberón, J. and A. T. Peterson. 2011. Ecological niche shifts and environmental space anisotropy: a cautionary note. Revista Mexicana de Biodiversidad 82:1348-1355. Tabachnick, B. G. and L. S. Fidell. 2007. Using Multivariate Statistics. 5th ed. Pearson Education, Boston. 980 p. Turc, L. 1954. Le bilan d’eau des sols: relations entre les precipitation, l’évaporation et l’écoulement. Annales Agronomiques 5:491-596. Villaseñor, J. L. 2010. El bosque húmedo de montaña en México y sus plantas vasculares: Catálogo florístico-taxonómico. Comisión Nacional para el Conocimiento y Uso de la Biodiversidad. Mexico, D. F. 40 p.

Log In

Potential species distribution modeling and the use of principal component analysis as predictor variables

Potential species distribution modeling and the use of principal component analysis as predictor variables

Related Papers

RELATED PAPERS