Due: 5pm Fri, Mar 6 2015 via GauchoSpace

Introduction

In this lab you’ll explore differences between sites to characterize “communities” based on dissimilarity in species composition and environment. Whittaker (1960) was first to describe species diversity between sites (beta; \(\beta\) diversity), which is in relation to diversity of the individual (alpha; \(\alpha\) diversity) and overall region (gamma; \(\gamma\) diversity).

You will quantify how “different” sites are from each other based on differences in species composition. This ecological distance between sites then forms the input matrix for an ordination technique that oders the sites along one or more axis so as to maximize the spread of ecological distances. You’ll then cluster these sites into similar groups.

Next, you’ll look at environmental variation between sites in a few ways. First, you’ll overlay environmental vectors in the “species” ordination space to show general trends. Then you’ll add environmental contour plots for a couple dominant variables to see how the range of environmental values trends across the sites in species space. Next you’ll conduct recursive partitioning to determine how environmental variables predict the groups.

Set Environment

Set your working directory in the R chunk below.

Study Area and Data

You’ll use the tree survey data from Sequoia National Park described by Urban et al (2002) from your assigned readings. The survey files in the data folder are:

You can find a more detailed description in data\sequoia_data.pdf.

Let’s start by mapping the study plots (see also Figure 2 from Urban et al, 2002).

Per the approach of Urban et al (2002), you’ll limit your analysis to the top 11 dominant species by average density (stems per hectare). Inclusion of very rare species can make it difficult for the ordination technique to converge on a solution. The following R chunk calculates the average species density (stems per hectare) and spits out the original species data now limited to the top 11 species into data/spp11.csv.

rank species avg_density
1 SEgi 22.1326263
2 ABco 18.1081818
3 ABma 13.0607071
4 PImo 9.7373737
5 PIla 7.1006061
6 PIje 5.3846465
7 CAde 4.2610101
8 PIco 3.0318182
9 PIpo 2.5250505
10 QUke 1.4663636
11 COnu 0.0505051
12 ARvi 0.0169697
13 CEin 0.0156566
14 QUch 0.0125253
15 UMca 0.0111111
16 ACma 0.0076768
17 TOca 0.0021212

Question: What are the most and least abundant species by common name for the subset of 11 most dominant species to be used in this analysis? (You can find the lookup table to go from Code to Common Name in the data\sequoia_data.pdf.)

1 Calculate Species Dissimilarity between Sites

The first step in this community analysis is to define the ecological distance between sites \(\beta\) (beta diversity) based on species composition.

The best known index of beta diversity is based on the ratio of total number of species in a collection of sites \(S\) (ie \(\gamma\) diversity) and the average richness per one site \(\bar \alpha\):

\[ \beta = S/\bar \alpha - 1 \ \]

Subtraction of one means that \(\beta = 0\) when there are no excess species or no heterogeneity between sites. For this study area we calculate it with the following:

spp17   = read.csv('data/spp17.csv', row.names=1)
alpha17 = mean(specnumber(spp17))
beta17  = ncol(spp17)/alpha17 - 1

spp11   = read.csv('data/spp11.csv', row.names=1)
alpha11 = mean(specnumber(spp11))
beta11  = ncol(spp11)/alpha11 - 1

We find the values are different for all 17 species (\(\alpha_{17}\) = 2.8484848; \(\beta_{17}\) = 4.9680851) than for the top 11 dominant species (\(\alpha_{11}\) = 2.7171717; \(\beta_{11}\) = 3.0483271).

Question: In simple terms, why the difference in \(\beta\) and \(\alpha\) terms between the dataset with all 17 species versus the most dominant 11?

This simple calculation is problematic though since \(S\) increases with the number of sites even when sites are all subsets of the same community. Whittaker (1960) noticed this, and suggested the index to be found from pairwise comparison of sites. The most commonly used metric for this is the Bray-Curtis dissimilarity (also known as the Sorenson index of dissimilarity):

\[ \beta = \frac{a+b+c}{(2a+b+c)/2} - 1 = \frac{b+c}{2a+b+c} \]

where the number of shared species in two sites is \(a\), and the numbers of species unique to each site are \(b\) and \(c\), then \(\bar \alpha = (2a + b + c)/2\) and \(S = a + b + c\). The Bray-Curtis dissimilarity is bound between 0 and 1, where 0 means the two sites have the same composition (ie they share all the same species), and 1 means the two sites do not share any species. This metric can be calculated between all sites using the vegan function vegdist:

# Bray-Curtis as binary (presence/absence) of species per equation above
spp11_bray_binary      = vegdist(spp11, method='bray', binary=T)

# Bray-Curtis as amount (density) of species
spp11_bray             = vegdist(spp11, method='bray')

# transformed Bray-Curtis as amount (density) of species
spp11_bray_transformed = vegdist(wisconsin(sqrt(spp11)), method='bray')

# write to data folder
write.csv(as.matrix(spp11_bray_binary),      'data/spp11_bray_binary.csv')
write.csv(as.matrix(spp11_bray),             'data/spp11_bray.csv')
write.csv(as.matrix(spp11_bray_transformed), 'data/spp11_bray_transformed.csv')

Note that in the R chunk above you calculated Bray Curtis using the binary (presence/absence) of species as well as using the density. You generally want to transform the input species composition data so that the inputs are normalized between the species and sites. All the ecological distance measures are output to the data folder.

Let’s look at how these values are distributed.

# show just the lower triangle of the matrix
i = lower.tri(as.matrix(spp11_bray_binary))
hist(as.matrix(spp11_bray_binary)[i]     , 
     xlab='ecological distance', main='spp11_bray_binary')

hist(as.matrix(spp11_bray)[i]            ,  
     xlab='ecological distance', main='spp11_bray')

hist(as.matrix(spp11_bray_transformed)[i],  
     xlab='ecological distance', main='spp11_bray_transformed')

In order to solidify subsequent processes in your mind, let’s look values of a few individual sites that are closest, furthest and some random pair in between. When you crack open data/spp11_bray_transformed.csv you’ll see 0’s along the diagonals corresponding to zero distance from a site to itself. Let’s not include those as potential closest sites. Also, you’ll notice not just a single max value (1), but many.

m = as.matrix(spp11_bray_transformed)
diag(m) = NA
i_max = which(m == max(m, na.rm=T), arr.ind=T)
# ... TO BE CONTINUED

Question: What are the dimensions [# rows X # columns] of the output ecological distance matrices in terms of sites/species/or…?

Question: Lookup the help for the vegdist function and name at least three other ecological distance measures.

Question: In the R chunk above, how were the species composition data transformed to create spp11_bray_transformed before feeding into the vegdist function? Lookup the functions used to describe this transformation.

2 Ordinate on Species Dissimilarity

In multivariate analysis, ordination (eg gradient analysis) orders objects that are characterized by values on multiple variables (i.e., multivariate objects) so that similar objects are near each other and dissimilar objects are farther from each other. These relationships between the objects, on one or more axes, are then characterized numerically and/or graphically. Here’s a dichotomus key for choosing the appropriate ordination technique from Dean Urban’s Multivariate class:

Question Choice
1a. Ancillary data (ENV) not available, or not used to constrain the ordination 2
1b. Ancillary data (ENV) available and used in ordination 4
2a. Species response assumed linear PCA (FA)
2b. Species response not linear, or unknown 3
3a. Species response assumed nonlinear and unimodal DCA (RA)
3b. Species response not linear, nor nonlinear and unimodal, or unknown NMS (PCoA, BC)
4a. Species response assumed linear RDA
4b. Species response not linear, or unknown 5
5a. Species response nonlinear and unimodal 6
5b. Species response not linear, nor nonlinear and unimodal, or unknown dbRDA
6a. A single (or few) environmental factors known (or required) as constraints DO
6b. Several factors suspected CCA

In the key, the lead technique is the best or default choice; alternatives in parentheses are options appropriate to particular data sets (see Applications Checklist, below). The techniques:

Since nonmetric multidimensional scaling has the fewest assumptions of the data, we’ll proceed with that technique.

spp11_mds = metaMDS(spp11, k=3, trace=F, trymax=100)
spp11_mds
## 
## Call:
## metaMDS(comm = spp11, k = 3, trymax = 100, trace = F) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     wisconsin(sqrt(spp11)) 
## Distance: bray 
## 
## Dimensions: 3 
## Stress:     0.07271247 
## Stress type 1, weak ties
## Two convergent solutions found after 97 tries
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'wisconsin(sqrt(spp11))'

Question Note in the R chunk above that the data input for the metaMDS function is the original species composition data sp11, but how did the argument k=3 and default values for arguments distance and autotransform affect the results?

Let’s look at the first 2 axis of ordination with site numbers in black and species in red.

plot(spp11_mds, type='t')

There’s a third axis which you can see in this plot.

ordiplot3d(spp11_mds, type='h')