Introduction
Set Environment
Study Area and Data
1 Calculate Species Dissimilarity between Sites
2 Ordinate on Species Dissimilarity
3 Cluster on Species Dissimilarity
4 Fit Environment into Ordination Space
5 Use Environment to Predict Cluster Groups
6 Other Diversity Indices
7 Species Accumulation Curve
Assignment
Further Resources

Due: 5pm Fri, Mar 6 2015 via GauchoSpace

Introduction

In this lab you’ll explore differences between sites to characterize “communities” based on dissimilarity in species composition and environment. Whittaker (1960) was first to describe species diversity between sites (beta; \(\beta\) diversity), which is in relation to diversity of the individual (alpha; \(\alpha\) diversity) and overall region (gamma; \(\gamma\) diversity).

You will quantify how “different” sites are from each other based on differences in species composition. This ecological distance between sites then forms the input matrix for an ordination technique that oders the sites along one or more axis so as to maximize the spread of ecological distances. You’ll then cluster these sites into similar groups.

Next, you’ll look at environmental variation between sites in a few ways. First, you’ll overlay environmental vectors in the “species” ordination space to show general trends. Then you’ll add environmental contour plots for a couple dominant variables to see how the range of environmental values trends across the sites in species space. Next you’ll conduct recursive partitioning to determine how environmental variables predict the groups.

Set Environment

Set your working directory in the R chunk below.

Study Area and Data

You’ll use the tree survey data from Sequoia National Park described by Urban et al (2002) from your assigned readings. The survey files in the data folder are:

env21.csv: environmental variables measured over 99 samples plots [99 sites X 21 environment variables]
spp17.csv: 17 tree species measured over the 99 sample plots as density (tree stems per hectare) [99 sites X 17 species]
plotxy.csv: the plots with their UTM coordinates (UTM zone 11N) [99 sites X 2 coordinates]

You can find a more detailed description in data\sequoia_data.pdf.

Let’s start by mapping the study plots (see also Figure 2 from Urban et al, 2002).

Per the approach of Urban et al (2002), you’ll limit your analysis to the top 11 dominant species by average density (stems per hectare). Inclusion of very rare species can make it difficult for the ordination technique to converge on a solution. The following R chunk calculates the average species density (stems per hectare) and spits out the original species data now limited to the top 11 species into data/spp11.csv.

rank	species	avg_density
1	SEgi	22.1326263
2	ABco	18.1081818
3	ABma	13.0607071
4	PImo	9.7373737
5	PIla	7.1006061
6	PIje	5.3846465
7	CAde	4.2610101
8	PIco	3.0318182
9	PIpo	2.5250505
10	QUke	1.4663636
11	COnu	0.0505051
12	ARvi	0.0169697
13	CEin	0.0156566
14	QUch	0.0125253
15	UMca	0.0111111
16	ACma	0.0076768
17	TOca	0.0021212

Question: What are the most and least abundant species by common name for the subset of 11 most dominant species to be used in this analysis? (You can find the lookup table to go from Code to Common Name in the data\sequoia_data.pdf.)

1 Calculate Species Dissimilarity between Sites

The first step in this community analysis is to define the ecological distance between sites \(\beta\) (beta diversity) based on species composition.

The best known index of beta diversity is based on the ratio of total number of species in a collection of sites \(S\) (ie \(\gamma\) diversity) and the average richness per one site \(\bar \alpha\):

\[ \beta = S/\bar \alpha - 1 \ \]

Subtraction of one means that \(\beta = 0\) when there are no excess species or no heterogeneity between sites. For this study area we calculate it with the following:

spp17   = read.csv('data/spp17.csv', row.names=1)
alpha17 = mean(specnumber(spp17))
beta17  = ncol(spp17)/alpha17 - 1

spp11   = read.csv('data/spp11.csv', row.names=1)
alpha11 = mean(specnumber(spp11))
beta11  = ncol(spp11)/alpha11 - 1

We find the values are different for all 17 species (\(\alpha_{17}\) = 2.8484848; \(\beta_{17}\) = 4.9680851) than for the top 11 dominant species (\(\alpha_{11}\) = 2.7171717; \(\beta_{11}\) = 3.0483271).

Question: In simple terms, why the difference in \(\beta\) and \(\alpha\) terms between the dataset with all 17 species versus the most dominant 11?

This simple calculation is problematic though since \(S\) increases with the number of sites even when sites are all subsets of the same community. Whittaker (1960) noticed this, and suggested the index to be found from pairwise comparison of sites. The most commonly used metric for this is the Bray-Curtis dissimilarity (also known as the Sorenson index of dissimilarity):

\[ \beta = \frac{a+b+c}{(2a+b+c)/2} - 1 = \frac{b+c}{2a+b+c} \]

where the number of shared species in two sites is \(a\), and the numbers of species unique to each site are \(b\) and \(c\), then \(\bar \alpha = (2a + b + c)/2\) and \(S = a + b + c\). The Bray-Curtis dissimilarity is bound between 0 and 1, where 0 means the two sites have the same composition (ie they share all the same species), and 1 means the two sites do not share any species. This metric can be calculated between all sites using the vegan function vegdist:

# Bray-Curtis as binary (presence/absence) of species per equation above
spp11_bray_binary      = vegdist(spp11, method='bray', binary=T)

# Bray-Curtis as amount (density) of species
spp11_bray             = vegdist(spp11, method='bray')

# transformed Bray-Curtis as amount (density) of species
spp11_bray_transformed = vegdist(wisconsin(sqrt(spp11)), method='bray')

# write to data folder
write.csv(as.matrix(spp11_bray_binary),      'data/spp11_bray_binary.csv')
write.csv(as.matrix(spp11_bray),             'data/spp11_bray.csv')
write.csv(as.matrix(spp11_bray_transformed), 'data/spp11_bray_transformed.csv')

Note that in the R chunk above you calculated Bray Curtis using the binary (presence/absence) of species as well as using the density. You generally want to transform the input species composition data so that the inputs are normalized between the species and sites. All the ecological distance measures are output to the data folder.

Let’s look at how these values are distributed.

# show just the lower triangle of the matrix
i = lower.tri(as.matrix(spp11_bray_binary))
hist(as.matrix(spp11_bray_binary)[i]     , 
     xlab='ecological distance', main='spp11_bray_binary')

hist(as.matrix(spp11_bray)[i]            ,  
     xlab='ecological distance', main='spp11_bray')

hist(as.matrix(spp11_bray_transformed)[i],  
     xlab='ecological distance', main='spp11_bray_transformed')

In order to solidify subsequent processes in your mind, let’s look values of a few individual sites that are closest, furthest and some random pair in between. When you crack open data/spp11_bray_transformed.csv you’ll see 0’s along the diagonals corresponding to zero distance from a site to itself. Let’s not include those as potential closest sites. Also, you’ll notice not just a single max value (1), but many.

m = as.matrix(spp11_bray_transformed)
diag(m) = NA
i_max = which(m == max(m, na.rm=T), arr.ind=T)
# ... TO BE CONTINUED

Question: What are the dimensions [# rows X # columns] of the output ecological distance matrices in terms of sites/species/or…?

Question: Lookup the help for the vegdist function and name at least three other ecological distance measures.

Question: In the R chunk above, how were the species composition data transformed to create spp11_bray_transformed before feeding into the vegdist function? Lookup the functions used to describe this transformation.

2 Ordinate on Species Dissimilarity

In multivariate analysis, ordination (eg gradient analysis) orders objects that are characterized by values on multiple variables (i.e., multivariate objects) so that similar objects are near each other and dissimilar objects are farther from each other. These relationships between the objects, on one or more axes, are then characterized numerically and/or graphically. Here’s a dichotomus key for choosing the appropriate ordination technique from Dean Urban’s Multivariate class:

Question	Choice
1a. Ancillary data (ENV) not available, or not used to constrain the ordination	2
1b. Ancillary data (ENV) available and used in ordination	4
2a. Species response assumed linear	PCA (FA)
2b. Species response not linear, or unknown	3
3a. Species response assumed nonlinear and unimodal	DCA (RA)
3b. Species response not linear, nor nonlinear and unimodal, or unknown	NMS (PCoA, BC)
4a. Species response assumed linear	RDA
4b. Species response not linear, or unknown	5
5a. Species response nonlinear and unimodal	6
5b. Species response not linear, nor nonlinear and unimodal, or unknown	dbRDA
6a. A single (or few) environmental factors known (or required) as constraints	DO
6b. Several factors suspected	CCA

In the key, the lead technique is the best or default choice; alternatives in parentheses are options appropriate to particular data sets (see Applications Checklist, below). The techniques:

PCA: Principal components analysis
FA: Factor analysis
DCA: Detrended correspondence analysis
RA: Reciprocal averaging
NMS: nonmetric multidimensional scaling
PCoA: Principal coordinates analysis
BC: Bray-Curtis (Polar) ordination
RDA: Redundancy analysis
CCA: Canonical correspondence analysis
dbRDA: distance-based RDA
DO: Direct ordination (includes weighted averaging)

Since nonmetric multidimensional scaling has the fewest assumptions of the data, we’ll proceed with that technique.

spp11_mds = metaMDS(spp11, k=3, trace=F, trymax=100)
spp11_mds

## 
## Call:
## metaMDS(comm = spp11, k = 3, trymax = 100, trace = F) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     wisconsin(sqrt(spp11)) 
## Distance: bray 
## 
## Dimensions: 3 
## Stress:     0.07271247 
## Stress type 1, weak ties
## Two convergent solutions found after 97 tries
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'wisconsin(sqrt(spp11))'

Question Note in the R chunk above that the data input for the metaMDS function is the original species composition data sp11, but how did the argument k=3 and default values for arguments distance and autotransform affect the results?

Let’s look at the first 2 axis of ordination with site numbers in black and species in red.

plot(spp11_mds, type='t')

There’s a third axis which you can see in this plot.

ordiplot3d(spp11_mds, type='h')

You can also interact with a 3D plot using the following commands in the Console which opens up a new window (that you need to find in the Windows bar). Click hold and rotate the axes. First, by sites:

if (interactive()){
  ordirgl(spp11_mds, size=2) # default sites
}

or by species:

if (interactive()){
  ordirgl(spp11_mds, display='species', type='t') # species
}

3 Cluster on Species Dissimilarity

Now hierarchically cluster the sites using the same transformed Bray-Curtis species dissimilarity input as the ordination.

# hierarchically cluster sites
clu = hclust(spp11_bray_transformed, 'average')
plot(clu)

Now cutoff the hierarchical cluster into three groups.

# cutoff clusters to 3 groups
plot(clu)
rect.hclust(clu, k=3) # or by height of similarity h=0.5

grp = cutree(clu, k=3)

Plot the groups in ordination space.

# plot convex hull of groups: overlap
plot(spp11_mds, display='sites')
ordihull(spp11_mds, grp, lty=2, col='red', label=T)

Which you can also do interactively in 3D.

if (interactive()){
  # plot 3D to see 3rd NMS axis
  ordirgl(spp11_mds, display='sites') #, type='t')
  orglspider(spp11_mds, grp)
}

4 Fit Environment into Ordination Space

Next, you’ll plot the environment vectors fitted into the “species” space.

env21 = read.csv('data/env21.csv', row.names=1)
ef = envfit(spp11_mds, env21, permu = 999)
ef

## 
## ***VECTORS
## 
##            NMDS1    NMDS2     r2 Pr(>r)    
## Elev     0.99995 -0.01038 0.8823  0.001 ***
## Slope   -0.62043  0.78426 0.0530  0.090 .  
## Tasp     0.71012 -0.70408 0.0053  0.753    
## TSI      0.84736 -0.53101 0.0138  0.515    
## xLitter -0.49325  0.86989 0.0503  0.068 .  
## xDepth  -0.60080  0.79940 0.4012  0.001 ***
## sDepth   0.72877  0.68476 0.0089  0.668    
## pH      -0.55792  0.82989 0.5088  0.001 ***
## C        0.58942 -0.80783 0.1307  0.001 ***
## N        0.33071 -0.94373 0.0704  0.026 *  
## CN       0.99981 -0.01968 0.2649  0.001 ***
## P       -0.09770  0.99522 0.0556  0.066 .  
## Ca      -0.86190  0.50709 0.2417  0.001 ***
## Mg      -0.99148 -0.13023 0.2255  0.001 ***
## K       -0.93975  0.34186 0.2126  0.001 ***
## Acid     0.76256 -0.64691 0.5098  0.001 ***
## ECEC    -0.93377  0.35787 0.1705  0.001 ***
## BS      -0.85626  0.51654 0.5999  0.001 ***
## Clay    -0.35055 -0.93654 0.0267  0.259    
## Silt    -0.25389 -0.96723 0.0030  0.863    
## Sand     0.26003  0.96560 0.0048  0.785    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

plot(spp11_mds, display = 'sites')
plot(ef, p.max = 0.1)

# TODO: highlight special sites and extract env values

Now choose two environment variables to create contours.

# pick two variables: pH and Elev
ef2 = envfit(spp11_mds ~ pH + Elev, env21)
plot(spp11_mds, display = 'sites')
plot(ef2)
tmp = with(env21, ordisurf(spp11_mds, pH, add=T))
with(env21, ordisurf(spp11_mds, Elev, add=T, col="green4"))

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## <environment: 0x7fd8533f6120>
## 
## Estimated degrees of freedom:
## 8.1  total = 9.1 
## 
## REML score: 620.2007

Get the difference in environmental values between groups (from hierarchical clustering above).

boxplot(pH ~ grp, data=env21, notch = TRUE, xlab='group', ylab='pH')

## Warning in bxp(structure(list(stats = structure(c(4.92, 5.24, 5.45, 5.73,
## : some notches went outside hinges ('box'): maybe set notch=FALSE

boxplot(Elev ~ grp, data=env21, notch = TRUE, xlab='group', ylab='Elev')

## Warning in bxp(structure(list(stats = structure(c(1545, 1673, 1923, 2093,
## : some notches went outside hinges ('box'): maybe set notch=FALSE

5 Use Environment to Predict Cluster Groups

Now you’ll use recursive partitioning to see how the environmental data can be used to predict the different groups based on site species dissimmilarity.

tree1 = rpart(factor(grp) ~ ., data=env21)
tree1_p = as.party(tree1)
tree1_p

## 
## Model formula:
## factor(grp) ~ Elev + Slope + Tasp + TSI + xLitter + xDepth + 
##     sDepth + pH + C + N + CN + P + Ca + Mg + K + Acid + ECEC + 
##     BS + Clay + Silt + Sand
## 
## Fitted party:
## [1] root
## |   [2] Elev < 2395
## |   |   [3] pH >= 4.845
## |   |   |   [4] Elev < 2162.5: 1 (n = 54, err = 0.0%)
## |   |   |   [5] Elev >= 2162.5: 3 (n = 8, err = 50.0%)
## |   |   [6] pH < 4.845: 3 (n = 11, err = 0.0%)
## |   [7] Elev >= 2395: 2 (n = 26, err = 0.0%)
## 
## Number of inner nodes:    3
## Number of terminal nodes: 4

plot(tree1_p)

6 Other Diversity Indices

Let’s plot some other diversity indices for each site and see how they relate to each other.

# diversity indices for 17 species
Shannon17 = diversity(spp17, index='shannon')
Simpson17 = diversity(spp17, index='simpson')
InvSimpson17 = diversity(spp17, 'inv')
Rare17 = rarefy(round(spp17), 2) - 1
Alpha17 = fisher.alpha(round(spp17))
pairs(cbind(Shannon17, Simpson17, InvSimpson17, Rare17, Alpha17), pch='+', col='blue')

# diversity indices for 17 species
Shannon11 = diversity(spp11, index='shannon')
Simpson11 = diversity(spp11, index='simpson')
InvSimpson11 = diversity(spp11, 'inv')
Rare11 = rarefy(round(spp11), 2) - 1
Alpha11 = fisher.alpha(round(spp11))
pairs(cbind(Shannon11, Simpson11, InvSimpson11, Rare11, Alpha11), pch='+', col='blue')

7 Species Accumulation Curve

Just for kicks and giggles, let’s look at species accumulation curve, or rarefaction curve, which shows us the number of species we expect to find as we add plots.

# species accumulation curves
sac <- specaccum(spp17)
plot(sac, ci.type='polygon', ci.col='gray')

Assignment

Besides the questions above, your assignment is to describe the species and environment associations between sites. Please reference species by “common name (code)”, such as “giant sequoia (SEgi).” Be explicit about the relationship between species and at least 3 environmental gradients as well as the 3 groups identified by the cluster analysis. You can simply refer to which sites are found in relatively high versus mid and low range environmental values. Treat this like a Results and Discussion section of a paper describing the community analysis of this Sequoia National Park study area. You might also wish to recall Figure 1.4 from the textbook (slide 18 of first lecture) by Whittaker (1956) showing the different species communities and environmental gradients of the Great Smoky Mountains.

To help you identify these differences choose Chunks -> Run All which should prompt you to click on sites (open circles). Click on sites at the environmental extremes and near the group’s center. Click ESC to exit the interactive selection of sites.

Once you’ve identified the sites, knit this document to generate the following 3 figures and 2 tables which you should include and reference by caption number in your Word document writeup. Any of the other results above you can similarly include, such as the recursive partitioning plot under “Use Environment to Predict Cluster Groups” by adding the figure, a figure number and caption.

Figure 1. Sites (circles), including selected sites (closed circles with numeric id), enclosed by group membership convex hull (red dashed line; red numeric label), associated species ordination centers (red character text), and environmental gradients (blue vector lines).

Table 1. Species density (stems per hectare) original (non-normalized) values per selected site.

	ABco	ABma	CAde	PIco	PIje	PIla	PImo	PIpo	QUke	SEgi
35	43.97	0.00	9.03	0.00	0.00	6.35	0.00	8.65	0.00	0.00
37	0.55	0.00	2.87	0.00	0.00	4.27	0.00	1.66	27.74	0.00
46	0.00	34.42	0.00	0.00	0.00	0.00	21.27	0.00	0.00	0.00
64	5.66	0.00	0.00	0.00	0.00	129.56	0.00	0.00	0.00	179.55
77	0.00	0.00	0.00	29.76	0.00	0.00	0.00	0.00	0.00	0.00
93	6.76	0.00	0.00	0.00	61.25	0.00	0.00	0.00	0.00	0.00
96	0.00	0.00	0.00	0.00	24.99	0.00	0.00	0.00	0.00	0.00

Table 2. Environmental original (non-normalized) values per selected site.

	Elev	Slope	Tasp	TSI	xLitter	xDepth	sDepth	pH	C	N	CN	P	Ca	Mg	K	Acid	ECEC	BS	Clay	Silt	Sand
35	1802	24	-0.15643	0.01871	6.3	96.52	11.88	5.56	3.450	0.130	26.09	123.23	7.02	0.817	0.457	0.077	8.37	99.07	0.6	19.1	80.3
37	1765	30	-0.84805	0.07265	4.9	27.27	33.07	5.35	3.073	0.160	19.21	145.95	5.78	0.677	0.347	0.123	6.92	97.94	0.2	18.2	81.6
46	2687	17	-0.22495	0.06968	2.7	54.60	36.30	4.15	6.373	0.200	34.20	76.35	2.01	0.267	0.397	2.327	5.00	54.78	1.6	29.0	69.3
64	2125	13	-0.70711	0.00439	9.7	91.13	23.14	6.37	1.607	0.067	23.83	27.57	6.67	0.403	0.450	-0.003	7.53	100.00	0.2	20.6	79.2
77	2822	5	0.60181	0.00436	2.7	17.30	10.26	4.27	2.177	0.090	24.40	238.43	0.32	0.067	0.130	1.060	1.57	33.27	0.0	20.5	79.6
93	2218	4	-0.19082	-0.01310	2.7	41.30	18.64	5.16	1.753	0.077	23.19	89.77	2.86	0.377	0.357	0.097	3.70	96.87	0.0	19.0	81.3
96	2017	10	0.97437	0.04057	1.3	4.07	4.87	3.90	6.330	0.283	22.27	81.60	1.56	0.237	0.200	2.337	4.33	46.60	2.1	22.3	75.6

Figure 2. Normalized species densities with with selected sites shown in rank order (blue text).

Figure 3. Normalized environmental densities with selected sites shown in rank order (blue text).

Further Resources

The data set and overall concepts were pulled from Dean Urban’s Multivariate Analysis class at Duke University. The specific R commands have been updated to use the Vegan R package as described by these instructive vignettes:

Other helpful lecture materials can be found here:

Lab 8. Communities

Ben Best

Feb 26, 2015