Due: 5pm Mon, Feb 23 2015 via GauchoSpace

Introduction

In this lab, you’ll predict the distribution of a species using the software package Maxent, short for “maximum entropy” – the statistical technique used to fit a model differentiating species observations from the background environmental data. You’ll use “presence” points from the Global Biodiversity Information Facility (GBIF), and bioclimatic environmental predictors from WorldClim.org.

Here’s a figure of the overall process:

1 Species Response Data from GBIF

1.1 Choose Species

Choose a species from this Species List and enter your name in the student column so we all do different species. This list was derived from the species listed in tables 24 and 27 from Santa Barbara County’s Burton Mesa Ecological Reserve Final Land Management Plan and Environmental Impact Report where at least 100 occurence records are found for Santa Barbara County’s extent (longitude: -121 to -119; latitude: 34 to 36) in the Global Biodiversity Information Facility GBIF.org.

Set your working directory (wd) and scientific species name (sp_scientific) of your chosen species (Genus species only) in the set_vars R chunk below.

wd = 'H:/esm215/lab6_species'
sp_scientific = 'Amphispiza belli'

1.2 Fetch GBIF Occurrences (automatic)

The next R chunk fetches the first 1000 occurrences from GBIF using the rgbif R package.

1.3 Filter by Bounding Box (interactive)

You’ll want to restrict observations to those in the Americas, so we can later predict to Santa Barbara County. To do this, you’re provided with an interactive map to draw a bounding box extent around the points of interest.

For the interactive drawing to work, you’ll want to go to upper right RStudio menu Chunks -> Run All so the R code is run from the Console. (Note: Knitting the R Markdown document is not interactive.)

The bounding box extent is saved to spp/*_extent.csv which is read in next time the code runs. (The interactive drawing of bounding box is presented if the file’s not found, or defaults to global extent if not run interactively.).

plot of chunk draw_bboxplot of chunk draw_bbox

If only a few points are within the Americas (and most in other continents), please update Species List and the set_vars R chunk with a new species selection, delete the spp folder, go to menu Chunks -> Run All again.

1.4 Partition Points into Training and Test (automatic)

The next R chunk:

  1. Filters the GBIF observation points based on the drawn extent from the previous step, and

  2. Partitions these points randomly into:

  1. train for model fitting (80% of filtered points), and

  2. test for model evaluation (20%).

  1. Plots train (red) and test (blue) points onto a map.

plot of chunk partition_plot_ptsplot of chunk partition_plot_pts

Question: Were there any GBIF observations for your species that you spatially filtered to achieve a study area restricted to the Americas? If observations were filtered, do you speculate that the populations elsewhere are distinct (ie have no demographic interactions) with the one(s) in the Americas?

2 Environmental Predictor Data from WorldClim

For this lab, you’ll be using environmental predictors from the WorldClim.org database. This database provides current, past and future climatic variables such as monthly mean/min/max temperature and precipitation (which are multiplied by 10 in order to store as integers). These monthly variables have been combined into 19 biologically relevant climatic variables (bioclim) plus altitude:

We’ll use the variables calculated for the following time periods:

You’ll proceed with the following scenarios which distinguish between the time (current vs future) and space (global, cropped or Santa Barbara) of chosen environmental predictors (all or top) in which the model is fitted versus predicted (sequentially novel aspect of scenario bolded):

scenario time, fitted space, fitted predictors time, predicted space, predicted
scenario_01 current global all current global
scenario_02 current cropped all current cropped
scenario_03 current cropped top current cropped
scenario_04 current cropped top current Santa Barbara
scenario_05 current cropped top future Santa Barbara

The “global” predictors you’ll be using have a cell size of 10 minutes (18.55 km at equator), from which “cropped” is limited to the geographic extent of the point observations. Then you’ll pick the top 3 environmental predictors plus altitude (“top 3 bio + alt”) based on percent contribution. Next, you’ll use this fitted model for predicting to a “Santa Barbara” extent using WorldClim predictors having a cell size of 0.5 minutes (0.93 km at equator).

Although we could theoretically use the R function getData from the raster library to fetch all these WorldClim data on the fly, the server is slow and inconsistent with responding. So these data have been provided for you.

The following R chunk creates the Maxent output scenario folders.

The next chunk of R code crops the global data to the extent of train and test data points, leaving you with the following predictor combinations of time and space in the env folder that you’ll use in the Maxent scenarios in the following order:

  1. current_10min_global
  2. current_10min_cropped
  3. current_0.5min_sb
  4. future_0.5min_sb

2.1 All Predictors to Current Global [scenario_01]

In order to capture the maximum possible extent of the species, you’ll start with the global predictors.

plot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-globalplot of chunk sc01_plot_env-current-global

Note that these global environmental predictor plots above are to be excluded from final writeup, so once you’ve seen them you can speed up the knitting of your document and turn them off by commenting out (ie precede the line with a # character) the plot command in the R chunk above.

2.1.1 Run Maxent

Now that you have your biological response data (in spp folder) and environmental predictor data (in env folder), you’re ready to fit, predict and evaluate a species distribution model with Maxent.

Double click on software\maxent.bat to launch Maxent’s graphical user interface. This batch file simply opens the Maxent Java archive (*.jar) with the Java engine using 6000 MB of memory: java -mx6000m -jar maxent.jar. There’s a non-fatal warning based on a slight mismatch between Java versions that you can safely ignore (“WARNING: Could not open/create prefs root node Softwareat root 0 x80000002. Windows RegCreateKeyEx(…) returned error code 5.”)

Configure scenario_01 with the following basic settings (where * represents your species):

  • Samples: browse to spp\*_train.csv. This sets the input species presence points for which Maxent extracts the environmental layer data and differentiates from the “background” (ie all other environmental data in the environmental layer extent.)

  • Environmental layers: browse to env\current_10min_global. This sets the folder of environmental data and automatically selects all valid rasters in the folder, which must have the same resolution and extent as each other.

  • Output directory: browse to scenario_01. This is where Maxent places all the output results from fitting, predicting and evaluating the model.

  • Output file type: bil. The band interleaved by line (.bil) format is simply a more compact in file size than the default ASCII text raster format (.asc).

  • Create response curves tick on. These response curves describe the statistical relationship between environmental predictor and species response.

  • Settings button to launch interface to more parameters

  • in the Basic tab, Test sample file: browse to spp\*_test.csv. This adds an extra set of evaluation diagnostics on species presence points which were partitioned from the original dataset.

When you click on the Run button, here’s the software/tutorial.doc description of what happens next…

A progress monitor describes the steps being taken. After the environmental layers are loaded and some initialization is done, progress towards training of the maxent model is shown like this:

The gain is closely related to deviance, a measure of goodness of fit used in generalized additive and generalized linear models. It starts at 0 and increases towards an asymptote during the run. During this process, Maxent is generating a probability distribution over pixels in the grid, starting from the uniform distribution and repeatedly improving the fit to the data. The gain is defined as the average log probability of the presence samples, minus a constant that makes the uniform distribution have zero gain. At the end of the run, the gain indicates how closely the model is concentrated around the presence samples; for example, if the gain is 2, it means that the average likelihood of the presence samples is exp(2) ≈ 7.4 times higher than that of a random background pixel. Note that Maxent isn’t directly calculating “probability of occurrence”. The probability it assigns to each pixel is typically very small, as the values must sum to 1 over all the pixels in the grid.

Once the progress meter disappears, Maxent is finished producing the results. At the global scale, this may take several minutes.

2.1.2 Inspect Maxent Results

Open the summary of results scenario_01/*.html. Note the sections:

  • Analysis of omission/commission

  • Figure: Omission and Predicted Area. The Cumulative threshold on the x-axis refers to the application of a threshold, above which continuously ranging values (0 to 100) are converted to binary, predicting presence and absence below. We can then look at teh fraction of total area predicted (red). So at a 0 threshold all cells are predicted as “present” hence fraction of 1, whereas at 100 no cells are predicted as “present” hence fraction of 0. Along this continuum of thresholds, we can evaluate how many points are missed from the input training (blue) and test (turquoise) points are missed, ie “omitted” versus the predicted random rate (black).

  • Figure: Sensitivity vs. 1 - Specificity. This plot shows the Receiver Operating Curve (ROC) for both training and test data. The area under the ROC curve (AUC) shown in the legend indicates how well the model performs along a range of thresholds (1 is perfect, 0.5 is random; the further to the upper left the blue and red lines to the upper left corner, the better the model).

  • Table: Logistic thresholds. Because we are using the default Output format parameter “Logistic”, ie ranging from 0 to 1, the Logistic threshold would be applied for converting the output from continuous (0 to 1) to binary (present vs absent). Of the many Descriptions available, “Maximum training sensitivity plus specificity” optimizes the tradeoff between sensitivity ( mimizing false presences, ie errors of comission) and specificity (mimizing false absences, ie errors of omission) using the training data. We’ll use this later to convert the continuous probability of encounter to a binary surface (present vs absent) for delineating species habitat “patches”.

  • Pictures of the model. The map here shows the continuous probability of occurrence with square dots showing train (white) and test (purple) locations. (The Explain tool, although nifty to explore, only seems to rarely work. You can try it out by using a new output directory, eg scenario_01-explain, untick Auto running the model again Although you could untick Auto features and Product features which disables predictor interactions. Turning off Product features is required to use the Explain tool, which might then work by double clicking the batch file *_explain.bat.)

  • Response curves Two sets of response curves are generated:

  1. Marginal species response (0 to 1 in y axis) to predictor (range of values in x axis) when all other predictors are averaged. These term plots can be hard to interpret when variables are correlated with each other.

  2. Dedicated species response to predictor for model using only that predictor. You can much more clearly see the “niche” of this species by the given environmental variable.

  • Analysis of variable contributions. The variables in this table are sorted by Percent contribution of the given variable to the overall gain of the fitted model. The Permutation importance describes the fractional difference in AUC score if the given variable is excluded. So Percent contribution describes overall model fit, and Permution importance relates to how much more accurate (ie sensitivity vs specificity) the model is by the variable’s inclusion.

  • Raw data outputs and control parameters Links to raw data file outputs and various statistics are included here.

For more details on Maxent input and output, check out the help and tutorial docs in the software folder.

Question: What are the top 3 predictors for Percent contribution versus top 3 for Permutation importance?

Question: Are there any other “hotspots” outside the extent of occurrence (test and train points), particularly outside the Americas, which represent similar habitats?

2.2 All Predictors to Current Cropped [scenario_02]

Next, we’ll limit the environmental background to the extent of occurrences to create a more refined fitted model.

Use the same settings as scenario_01, except:

  • Environmental layers: browse to env\current_10min_cropped

  • Output directory: browse to scenario_02

Here are the environmental predictors cropped to the extent of occurrence.

plot of chunk sc02_plot_env-current-cropped

Note that these cropped environmental predictor plots above are to be excluded from final writeup, so once you’ve seen them you can speed up the knitting of your document and turn them off by commenting out (ie precede the line with a # character) the plot command in the R chunk above.

After you Run the model, inspect the results scenario_02/*.html.

Question: Now, what are the top 3 predictors for Percent contribution versus top 3 for Permutation importance?

Question: Compare the following dedicated altitude response curves between the global [scenario_01] and cropped [scenario_02] models? Briefly explain why these might be different.

  • global [scenario_01]:

  • cropped [scenario_02]:

2.3 Top Predictors to Current Cropped [scenario_03]

To simplify the prediction and interpretation of the model, choose the combination of “top” predictors that comprise:

  • altitude (alt)

  • top 3 variables by Percent contribution

  • top 3 variables by Permutation importance

Question: What are these “top” predictors?

Use the same Marxan settings as scenario_02, except:

  • Environmental layers: continuing with env\current_10min_cropped, Deselect all and individually tick the “top”" predictors identified above.

  • Output directory: browse to scenario_03

Question: By dropping the other predictor variables, how much predictive performance did we lose? In technical terms, report the difference in Area Under the Curve of the Test data going from scenario_02 to scenario_03.

2.4 Top Predictors to Current Santa Barbara [scenario_04]

Here are the environmental predictors for Santa Barbara County’s current climate.

plot of chunk sc04_plot_env-current-sb

Note that these global environmental predictor plots above are to be excluded from final writeup, so once you’ve seen them you can speed up the knitting of your document and turn them off by comment commenting out (ie precede the line with a # character) the plot command in the R chunk above.

We’ll continue to use the model fitted to the cropped extent of occurrence, but predict to Santa Barbara County at a much higher resolution (0.5 minutes). You can fit a model and predict to a different set of environmental rasters by using the parameter “Projection layers directory/file.” Given that we already fitted the model, we can directly predict to a different set of environmental data with the following command (per the software/tutorial.doc:

java -cp maxent.jar density.Project lambdaFile gridDir outFile

This command gets written to the batch file scenario_04/maxent_predict.bat, similar to software/maxent.bat, by the R chunk below. You’ll need to Knit or Run All chunks in order for this code to generate the scenario_04 file. Then simply double-click on scenario_04/maxent_predict.bat to predict the scenario_03 fitted model to the current Santa Barbara half minute environment env/current_0.5min_sb and output to the scenario_04 folder. You should see a black command window flicker up and the folder populate with 4 raster files (*_current_0.5min_sb*).

Here’s the continuous probability of encounter for Amphispiza belli in Santa Barbara County’s current climate:

plot of chunk sc04_plot_current_sb

2.5 Top Predictors to Future Santa Barbara [scenario_05]

Here are the environmental predictors for Santa Barbara County’s future climate.

plot of chunk sc05_plot_env-future-sb

Note that these global environmental predictor plots above are to be excluded from final writeup, so once you’ve seen them you can speed up the knitting of your document and turn them off by comment commenting out (ie precede the line with a # character) the plot command in the R chunk above.

Similar to the previous scenario, you’ll predict to the future Santa Barbara climate with the automatically generated batch file scenario_05/maxent_predict.bat which will populate the folder with 4 raster files (*_future_0.5min_sb*).

Here’s the continuous probability of encounter for Amphispiza belli in Santa Barbara County’s future climate:

plot of chunk sc05_plot_future_sb

2.5.1 Compare Current and Future Santa Barbara Scenarios

Next, we’ll find the threshold that converts 20% of the landscape into species habitat, so you have a reasonable quantity of patches for conducting the Connectivity lab. Then we’ll apply that same threshold to the future scenario to return the: lost (red; -1), core (blue; 0), and novel (green; 1) habitats.

plot of chunk sc05_future-vs-current_sb

Threshold above which converts 20% of the landscape to habitat for SB current climate: 0.6755.

zone area_km2 pct
-1 1838.0 22.696
0 243.2 3.004
1 6017.2 74.300

Question. Compare the threshold above that obtains 20% of the landscape as habitat with that of the “Maximum training sensitivity plus specificity” found in the “Logistic threshold” table scenario_03 model results. Would using the threshold “Maximum training sensitivity plus specificity” result in more or less habitat for the current climate?

Assignment

For this assignment, you’re are expected to generate all 5 scenario outputs, knit this lab6.Rmd document to have all outputs included, and respond to all the questions above as a seperate dedicated Word document lab6_answers.docx directly in the lab6_species folder (be sure to include the original question). Since the environmental predictor plots will be the same regardless of species chosen, please exclude them from your knitted document by commenting out (ie precede the line with a # character) the plot command from within the following R chunks:

To evaluate your responses, I’d like to see your writeup and all scenario outputs in a zip file that you submit. The easiest way to do this is from Windows Explorer, right-click on the lab6_species folder -> 7-Zip -> Add to lab6_species.zip. Double click on this file to open it in 7-Zip, navigate into the lab6_species folder, select and delete the following subfolders: env, software, figures. After you close 7-Zip, submit this lab6_species.zip file into the GauchoSpace lab6 assignment.

Further Resources

You might be interested in perusing resources on species distribution modeling from week 4 of last quarter’s Advanced GIS course which used a different modeling technique, the generalized linear model (vs Maxent) and pseudo-absence points (vs background).