- Introduction
- Set Environment
- Study Area and Data
- 1 Calculate Species Dissimilarity between Sites
- 2 Ordinate on Species Dissimilarity
- 3 Cluster on Species Dissimilarity
- 4 Fit Environment into Ordination Space
- 5 Use Environment to Predict Cluster Groups
- 6 Other Diversity Indices
- 7 Species Accumulation Curve
- Assignment
- Further Resources

Due: 5pm Fri, Mar 6 2015 via GauchoSpace

In this lab you’ll explore differences between sites to characterize “communities” based on dissimilarity in species composition and environment. Whittaker (1960) was first to describe species diversity between sites (*beta*; \(\beta\) diversity), which is in relation to diversity of the individual (*alpha*; \(\alpha\) diversity) and overall region (*gamma*; \(\gamma\) diversity).

You will quantify how “different” sites are from each other based on differences in species composition. This *ecological distance* between sites then forms the input matrix for an *ordination* technique that oders the sites along one or more axis so as to maximize the spread of ecological distances. You’ll then *cluster* these sites into similar groups.

Next, you’ll look at environmental variation between sites in a few ways. First, you’ll overlay *environmental vectors* in the “species” ordination space to show general trends. Then you’ll add *environmental contour* plots for a couple dominant variables to see how the range of environmental values trends across the sites in species space. Next you’ll conduct *recursive partitioning* to determine how environmental variables predict the groups.

Set your working directory in the R chunk below.

You’ll use the tree survey data from Sequoia National Park described by Urban et al (2002) from your assigned readings. The survey files in the `data`

folder are:

**env21.csv**: environmental variables measured over 99 samples plots [99 sites X 21 environment variables]**spp17.csv**: 17 tree species measured over the 99 sample plots as density (tree stems per hectare) [99 sites X 17 species]**plotxy.csv**: the plots with their UTM coordinates (UTM zone 11N) [99 sites X 2 coordinates]

You can find a more detailed description in `data\sequoia_data.pdf`

.

Let’s start by mapping the study plots (see also Figure 2 from Urban et al, 2002).

Per the approach of Urban et al (2002), you’ll limit your analysis to the top 11 dominant species by average density (stems per hectare). Inclusion of very rare species can make it difficult for the ordination technique to converge on a solution. The following R chunk calculates the average species density (stems per hectare) and spits out the original species data now limited to the top 11 species into `data/spp11.csv`

.

rank | species | avg_density |
---|---|---|

1 | SEgi | 22.1326263 |

2 | ABco | 18.1081818 |

3 | ABma | 13.0607071 |

4 | PImo | 9.7373737 |

5 | PIla | 7.1006061 |

6 | PIje | 5.3846465 |

7 | CAde | 4.2610101 |

8 | PIco | 3.0318182 |

9 | PIpo | 2.5250505 |

10 | QUke | 1.4663636 |

11 | COnu | 0.0505051 |

12 | ARvi | 0.0169697 |

13 | CEin | 0.0156566 |

14 | QUch | 0.0125253 |

15 | UMca | 0.0111111 |

16 | ACma | 0.0076768 |

17 | TOca | 0.0021212 |

**Question**: What are the most and least abundant species by common name for the subset of 11 most dominant species to be used in this analysis? (*You can find the lookup table to go from Code to Common Name in the data\sequoia_data.pdf.*)

The first step in this community analysis is to define the *ecological distance* between sites \(\beta\) (beta diversity) based on species composition.

The best known index of beta diversity is based on the ratio of total number of species in a collection of sites \(S\) (ie \(\gamma\) diversity) and the average richness per one site \(\bar \alpha\):

\[ \beta = S/\bar \alpha - 1 \ \]

Subtraction of one means that \(\beta = 0\) when there are no excess species or no heterogeneity between sites. For this study area we calculate it with the following:

```
spp17 = read.csv('data/spp17.csv', row.names=1)
alpha17 = mean(specnumber(spp17))
beta17 = ncol(spp17)/alpha17 - 1
spp11 = read.csv('data/spp11.csv', row.names=1)
alpha11 = mean(specnumber(spp11))
beta11 = ncol(spp11)/alpha11 - 1
```

We find the values are different for all 17 species (\(\alpha_{17}\) = 2.8484848; \(\beta_{17}\) = 4.9680851) than for the top 11 dominant species (\(\alpha_{11}\) = 2.7171717; \(\beta_{11}\) = 3.0483271).

**Question**: In simple terms, why the difference in \(\beta\) and \(\alpha\) terms between the dataset with all 17 species versus the most dominant 11?

This simple calculation is problematic though since \(S\) increases with the number of sites even when sites are all subsets of the same community. Whittaker (1960) noticed this, and suggested the index to be found from pairwise comparison of sites. The most commonly used metric for this is the Bray-Curtis dissimilarity (also known as the Sorenson index of dissimilarity):

\[ \beta = \frac{a+b+c}{(2a+b+c)/2} - 1 = \frac{b+c}{2a+b+c} \]

where the number of shared species in two sites is \(a\), and the numbers of species unique to each site are \(b\) and \(c\), then \(\bar \alpha = (2a + b + c)/2\) and \(S = a + b + c\). The Bray-Curtis dissimilarity is bound between 0 and 1, where 0 means the two sites have the same composition (ie they share all the same species), and 1 means the two sites do not share any species. This metric can be calculated between all sites using the vegan function `vegdist`

:

```
# Bray-Curtis as binary (presence/absence) of species per equation above
spp11_bray_binary = vegdist(spp11, method='bray', binary=T)
# Bray-Curtis as amount (density) of species
spp11_bray = vegdist(spp11, method='bray')
# transformed Bray-Curtis as amount (density) of species
spp11_bray_transformed = vegdist(wisconsin(sqrt(spp11)), method='bray')
# write to data folder
write.csv(as.matrix(spp11_bray_binary), 'data/spp11_bray_binary.csv')
write.csv(as.matrix(spp11_bray), 'data/spp11_bray.csv')
write.csv(as.matrix(spp11_bray_transformed), 'data/spp11_bray_transformed.csv')
```

Note that in the R chunk above you calculated Bray Curtis using the binary (presence/absence) of species as well as using the density. You generally want to transform the input species composition data so that the inputs are normalized between the species and sites. All the ecological distance measures are output to the data folder.

Let’s look at how these values are distributed.

```
# show just the lower triangle of the matrix
i = lower.tri(as.matrix(spp11_bray_binary))
hist(as.matrix(spp11_bray_binary)[i] ,
xlab='ecological distance', main='spp11_bray_binary')
```

```
hist(as.matrix(spp11_bray)[i] ,
xlab='ecological distance', main='spp11_bray')
```

```
hist(as.matrix(spp11_bray_transformed)[i],
xlab='ecological distance', main='spp11_bray_transformed')
```

In order to solidify subsequent processes in your mind, let’s look values of a few individual sites that are closest, furthest and some random pair in between. When you crack open `data/spp11_bray_transformed.csv`

you’ll see 0’s along the diagonals corresponding to zero distance from a site to itself. Let’s not include those as potential closest sites. Also, you’ll notice not just a single max value (1), but many.

```
m = as.matrix(spp11_bray_transformed)
diag(m) = NA
i_max = which(m == max(m, na.rm=T), arr.ind=T)
# ... TO BE CONTINUED
```

**Question**: What are the dimensions [# rows X # columns] of the output ecological distance matrices in terms of sites/species/or…?

**Question**: Lookup the help for the `vegdist`

function and name at least three other ecological distance measures.

**Question**: In the R chunk above, how were the species composition data transformed to create `spp11_bray_transformed`

before feeding into the `vegdist`

function? *Lookup the functions used to describe this transformation.*

In multivariate analysis, *ordination* (eg *gradient analysis*) orders objects that are characterized by values on multiple variables (i.e., multivariate objects) so that similar objects are near each other and dissimilar objects are farther from each other. These relationships between the objects, on one or more axes, are then characterized numerically and/or graphically. Here’s a dichotomus key for choosing the appropriate ordination technique from Dean Urban’s Multivariate class:

Question | Choice |
---|---|

1a. Ancillary data (ENV) not available, or not used to constrain the ordination | 2 |

1b. Ancillary data (ENV) available and used in ordination | 4 |

2a. Species response assumed linear | PCA (FA) |

2b. Species response not linear, or unknown | 3 |

3a. Species response assumed nonlinear and unimodal | DCA (RA) |

3b. Species response not linear, nor nonlinear and unimodal, or unknown | NMS (PCoA, BC) |

4a. Species response assumed linear | RDA |

4b. Species response not linear, or unknown | 5 |

5a. Species response nonlinear and unimodal | 6 |

5b. Species response not linear, nor nonlinear and unimodal, or unknown | dbRDA |

6a. A single (or few) environmental factors known (or required) as constraints | DO |

6b. Several factors suspected | CCA |

In the key, the lead technique is the best or default choice; alternatives in parentheses are options appropriate to particular data sets (see Applications Checklist, below). The techniques:

- PCA: Principal components analysis
- FA: Factor analysis
- DCA: Detrended correspondence analysis
- RA: Reciprocal averaging
- NMS: nonmetric multidimensional scaling
- PCoA: Principal coordinates analysis
- BC: Bray-Curtis (Polar) ordination
- RDA: Redundancy analysis
- CCA: Canonical correspondence analysis
- dbRDA: distance-based RDA
- DO: Direct ordination (includes weighted averaging)

Since nonmetric multidimensional scaling has the fewest assumptions of the data, we’ll proceed with that technique.

```
spp11_mds = metaMDS(spp11, k=3, trace=F, trymax=100)
spp11_mds
```

```
##
## Call:
## metaMDS(comm = spp11, k = 3, trymax = 100, trace = F)
##
## global Multidimensional Scaling using monoMDS
##
## Data: wisconsin(sqrt(spp11))
## Distance: bray
##
## Dimensions: 3
## Stress: 0.07271247
## Stress type 1, weak ties
## Two convergent solutions found after 97 tries
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'wisconsin(sqrt(spp11))'
```

**Question** Note in the R chunk above that the data input for the `metaMDS`

function is the original species composition data `sp11`

, but how did the argument `k=3`

and default values for arguments `distance`

and `autotransform`

affect the results?

Let’s look at the first 2 axis of ordination with site numbers in black and species in red.

`plot(spp11_mds, type='t')`

There’s a third axis which you can see in this plot.

`ordiplot3d(spp11_mds, type='h')`