Lab 7 - Biodiversity Metrics

Authors

Charles Lehnen

Melissa Guzman

Cameron Egan

Biodiversity

There are many ways to measure biodiversity and to compare these metrics between populations or communities. The simplest two are often considered to be richness and abundance.

Richness

Do you remember the difference between richness and abundance?
  • Richness is the number of species in a community while abundance is the number of individuals from each species in that community.
Do you remember the three ways we have talked about measuring diversity this semester?
  • Alpha, beta, and gamma diversity

Gamma Diversity

It’s time to collect data along a transect! Create a dataframe for this dataset across the five vegetation zones which follow an elevation gradient:

# Directly creating the long format dataframe

df <- data.frame(
  
  plot_id = c(1, 1, 1, 1, 1,
              2, 2, 2, 2, 2,
              3, 3, 3, 3, 3,
              4, 4, 4, 4, 4,
              5, 5, 5, 5, 5),
  
  species_id = c("triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp",
                 "triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp",
                 "triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp",
                 "triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp",
                 "triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp"),
  
  count = c(2, 2, 2, 2, 2,
            6, 1, 1, 1, 1,
            4, 1, 2, 2, 2,
            4, 2, 0, 4, 0,
            0, 5, 3, 0, 2)
)
head(df)
  plot_id  species_id count
1       1 triangle_sp     2
2       1   square_sp     2
3       1    cross_sp     2
4       1   circle_sp     2
5       1        x_sp     2
6       2 triangle_sp     6

What is the species richness of our region?

# There are various ways to accomplish this. Here are a couple examples of methods

gamma <- length(unique(df$species_id))

gamma
[1] 5
gamma <-  df %>% 
          select(species_id) %>%
          n_distinct()

gamma
[1] 5

This total count of species richness across a region is referred to as \(\gamma\) diversity.

Alpha Diversity

Similarly, \(\alpha\) diversity is the richness of a local area or a single sample.

Calculate the \(\alpha\) diversity of plots 1, 2, 4, and 5 using a dplyr pipe

alpha_1 <- df %>% 
  filter(plot_id == 1) %>%
  filter(count > 0) %>%
  select(species_id) %>%
  n_distinct()

alpha_2 <- df %>% 
  filter(plot_id == 2) %>%
  filter(count > 0) %>%
  select(species_id) %>%
  n_distinct()

alpha_4 <- df %>% 
  filter(plot_id == 4) %>%
  filter(count > 0) %>%
  select(species_id) %>%
  n_distinct()

alpha_5 <- df %>% 
  filter(plot_id == 5) %>%
  filter(count > 0) %>%
  select(species_id) %>%
  n_distinct()

alpha_1
[1] 5
alpha_2
[1] 5
alpha_4
[1] 3
alpha_5
[1] 3
How similar are these plots? How distinct are they? Is this value or high or low for our region?
  • \(\alpha\) diversity can start to answer these questions, but doesn’t provide too clear of an answer.
  • Additionally, note that the species in plots 4 and 5 are extremely different, even though their richness (alpha diversity) is identical

Calculating \(\beta\) diversity can begin to help us explore these nuances.

Beta Diversity

\(\beta\) diversity has two distinct definitions. It is important to understand which definition someone is using when referring to \(\beta\) diversity. These definitions are:

  • The turnover in species composition among sites across a region or

  • The number of species in a region not observed in a given sample

Therefore both \(\alpha\) (site) and \(\gamma\) (regional) diversity are directly related to \(\beta\) diversity.

\(\beta\) diversity serves as a crucial bridge connecting local biodiversity (\(\alpha\) diversity) with the wider regional species pool (\(\gamma\) diversity). Examining beta diversity is central to community ecology, as it explores the factors that cause species assemblages to vary in similarity across various locations and times.

It is a very popular metric whose nuances are still being explored and perfected for specific scenarios:

Source: Anderson et. al 2011(Anderson et al. 2011)

Classical Beta Diversity Metrics

In its simplest and original form, turnover was calculated via multiplicative partitioning:

\[ \beta \ =\ \frac{\gamma }{\overline{\alpha }} \]

Where \(\overline{\alpha }\) is the average \(\alpha\) diversity of the sample. This formula is similar to the original definition coined by Whittaker(Whittaker 1960).

The second traditional way \(\beta\) diversity was calculated was via additive partitioning.

\[ \beta \ =\ \gamma \ -\ \overline{\alpha } \]

Here, \(\beta\) diversity is the average number of species not found in a sample, even though they are found in that region.

We will calculate the \(\beta\) diversity of sites 4 and 5. To help us do so, we will introduce you to the currently most popular community ecology package in R, vegan (Oksanen et al. 2022).

First, we will need to reformat the data into the format that vegan prefers called a “community data matrix,” with only species names as column names, sampling events down the Y-axis, and abundances where they meet.

For example, imagine you have a day of sampling in a single plot in the field. You would tally the amount of each species you see that day in that same row.

Inspect our dataframe df

df
   plot_id  species_id count
1        1 triangle_sp     2
2        1   square_sp     2
3        1    cross_sp     2
4        1   circle_sp     2
5        1        x_sp     2
6        2 triangle_sp     6
7        2   square_sp     1
8        2    cross_sp     1
9        2   circle_sp     1
10       2        x_sp     1
11       3 triangle_sp     4
12       3   square_sp     1
13       3    cross_sp     2
14       3   circle_sp     2
15       3        x_sp     2
16       4 triangle_sp     4
17       4   square_sp     2
18       4    cross_sp     0
19       4   circle_sp     4
20       4        x_sp     0
21       5 triangle_sp     0
22       5   square_sp     5
23       5    cross_sp     3
24       5   circle_sp     0
25       5        x_sp     2
Are there any changes that need to be made in order for this dataframe to be in community data matrix format?
  • Extra columns (plot_id, species_id) need to be removed. Community data matrices can only include species names as column names and abundances

  • The matrix needs to be pivoted. The species names are entries instead of column names.

The table is in long format so we will need to widen the dataframe by species name. In general data collection and data presentation should be done in a long format since it is more human-readable, whereas analysis should be done in the long format since it is more easily computer readable.

To do so, we will use the tidyR(Wickham and Wickham 2017) pivot_wider() function.

comm_data_matrix <- df %>%
  pivot_wider(names_from = species_id, values_from = count, values_fill = 0)

head(comm_data_matrix)
# A tibble: 5 × 6
  plot_id triangle_sp square_sp cross_sp circle_sp  x_sp
    <dbl>       <dbl>     <dbl>    <dbl>     <dbl> <dbl>
1       1           2         2        2         2     2
2       2           6         1        1         1     1
3       3           4         1        2         2     2
4       4           4         2        0         4     0
5       5           0         5        3         0     2

Remember, community data matrices require columns to only represent species, so you must select() for only species columns before passing a dataframe to vegan.

comm_data_matrix <- comm_data_matrix %>%
  select(-plot_id)

comm_data_matrix
# A tibble: 5 × 5
  triangle_sp square_sp cross_sp circle_sp  x_sp
        <dbl>     <dbl>    <dbl>     <dbl> <dbl>
1           2         2        2         2     2
2           6         1        1         1     1
3           4         1        2         2     2
4           4         2        0         4     0
5           0         5        3         0     2

With this information, create separate community data matrices (df_1, df_2,df_4, df_5) for sites 1, 2, 4, and 5 respectively using filter(row_number() == <row number>)

df_1 <- comm_data_matrix %>%
  filter(row_number() == 1)

df_2 <- comm_data_matrix %>%
  filter(row_number() == 2)

df_4 <- comm_data_matrix %>%
  filter(row_number() == 4)

df_5 <- comm_data_matrix %>%
  filter(row_number() == 5)

To calculate \(\overline{\alpha }\), use the vegan function specnumber():

alpha_bar <- mean(specnumber(df))

Noting that specnumber(df) is a much shorter equivalent to what we did above:

alpha <- df %>% 
  filter(plot_id == <plot number>) %>%
  filter(count > 0) %>%
  select(<species column>) %>%
  n_distinct()

Use this to create the objects alpha_bar_12 and alpha_bar_45

alpha_bar_12 <- mean(specnumber(df_1), specnumber(df_2))
alpha_bar_45 <- mean(specnumber(df_4), specnumber(df_5))

Use these objects and the gamma diversity we calculated above to calculate multiplicative and additive beta diversity for each plot:

# Calculate multiplicative beta diversity for each plot
beta_mult_12 <- gamma / alpha_bar_12
beta_mult_45 <- gamma / alpha_bar_45
# Calculate additive beta diversity for each plot
beta_add_12 <- gamma - alpha_bar_12
beta_add_45 <- gamma - alpha_bar_45
         Metric    Value
1         gamma 5.000000
2       alpha_1 5.000000
3       alpha_2 5.000000
4       alpha_4 3.000000
5       alpha_5 3.000000
6  alpha_bar_12 5.000000
7  alpha_bar_45 3.000000
8  beta_mult_12 1.000000
9  beta_mult_45 1.666667
10  beta_add_12 0.000000
11  beta_add_45 2.000000
What do these results imply?
  • These \(\beta\) diversity values are useful because they are more readily comparable to other regions because they are normalized for the regional diversity (\(\gamma\) diversity)

  • The \(\beta\) diversity values for plots 1 and 2 are at the maximum ( \(\overline{\alpha}\) = \(\beta\) ), the richness of species in these plots is representative of the whole region. The multiplicative and additive values of 0 and 1 respectively imply no species turnover between these plots.

  • The multiplicative and additive \(\beta\) diversity values for plots 4 and 5 is higher than for plots 1 and 2, therefore these plots have more species turnover. High values of \(\beta\) diversity imply these are diverse, non-redundant plots that together build the regional diversity.

Jaccard and Sørenson Indices to Calculate Beta Diversity

For more exact estimates of \(\beta\) diversity, we can calculate the Jaccard and Sørenson distances. These distances are both measures of similarity between two sites, with the Jaccard index being a measure of similarity and the Sørenson index being a measure of dissimilarity. They also both only consider presence/absence, so we are still only looking at species richness.

Let’s start with Jaccard. Sites that are most similar are closer to 0 and those that are most different are closer to 1.

# Calculate Jaccard distance
jaccard_dist <- vegdist(comm_data_matrix, method = "jaccard")

jaccard_dist
          1         2         3         4
2 0.5714286                              
3 0.2500000 0.3846154                    
4 0.5714286 0.5714286 0.5000000          
5 0.5714286 0.8235294 0.6875000 0.8888889

Which plots are are most similar and which are different? Does this match your intuition?

Now let’s do Sørenson which ranges from 0 (identical) to 1 (completely dissimilar).

# Calculate Sørensen distance
# Sørensen is equivalent to "bray" method with binary data in vegdist
sorensen_dist <- vegdist(comm_data_matrix, method = "bray", binary = TRUE)

sorensen_dist
          1         2         3         4
2 0.0000000                              
3 0.0000000 0.0000000                    
4 0.2500000 0.2500000 0.2500000          
5 0.2500000 0.2500000 0.2500000 0.6666667

Which plots are are most similar and which are different? Does this match your intuition?

Other Methods to Calculate Beta Diversity

You must be intentional about which version of \(\beta\) diversity is applicable to your question and system. Many options are available and continue to be developed/popularized for calculating \(\beta\) diversity for different scenarios. Based on our dataset and this flowchart, which method(s) might be the best for calculating \(\beta\) diversity in our case?

Source: Anderson et. al 2011(Anderson et al. 2011)
  • We have not yet considered abundances. However, since we have abundance data we can integrate it into our calculation of \(\beta\) diversity to more accurately differentiate between communities.
  • It is likely that we want to exclude joint absences. With the low number of species in our survey, there are many more species in the world that our system is lacking than it includes. Therefore, what species are in our plot gives us more information than what is not in our plot.
  • Choosing our exact method for calculation will depend on our research question. Are we most interested in rare species or the proportion of species? Are we more interested in relative abundance or log abundance (for example, is there a large difference between abundances)?


We will explore abundance metrics in the following section


Abundance

Remember, richness is the number of species in a community while abundance is the number of individuals from each species in that community.

Up until this point, we have only been using richness in our metrics, however our sample includes abundance values.

Use dplyr to group our long dataframe df by species_id and use mutate() to create a column of total abundances called total_count.

df <- df %>%
  group_by(species_id) %>% 
  mutate(total_count = sum(count)) %>%
  ungroup()
            
df
# A tibble: 25 × 4
   plot_id species_id  count total_count
     <dbl> <chr>       <dbl>       <dbl>
 1       1 triangle_sp     2          16
 2       1 square_sp       2          11
 3       1 cross_sp        2           8
 4       1 circle_sp       2           9
 5       1 x_sp            2           7
 6       2 triangle_sp     6          16
 7       2 square_sp       1          11
 8       2 cross_sp        1           8
 9       2 circle_sp       1           9
10       2 x_sp            1           7
# ℹ 15 more rows

Plot the total abundances of our df as a barplot (geom_col()) using ggplot2(Wickham 2016).

ggplot(data = df, aes(x = species_id, y = total_count)) +
  geom_col() +
  labs(x = "Species", y = "Total Count Across Regions")

What do you notice about the abundances?
  • The species “triangle_sp” is the dominant species, the others are more similar in proportion

Rank Abundance Curves

Rank abundance curves are a common way to visualize the distribution of species abundances in a community. They are a type of species accumulation curve, which are used to estimate the number of species in a community. They give you a quick way to compare how similar/different communities are just based on simple abundances.

# Rank abundances of each species within each plot
df <- df %>%
  group_by(plot_id, species_id) %>%
  summarize(plot_count = sum(count)) %>%
  arrange(plot_id, desc(plot_count)) %>%
  mutate(rank = row_number()) %>%
  ungroup()
`summarise()` has grouped output by 'plot_id'. You can override using the
`.groups` argument.
ggplot(df, aes(x = rank, y = plot_count, color = factor(plot_id))) +
  geom_point() +
  geom_line() +
  labs(x = "Abundance Rank (log scale)", y = "Abundance (log scale)", color = "Plot ID")

What do you notice?
  • Plot 2 is dominated by a single species. In the real world, many rank abundance curves take a similar shape with few dominant species and many rare species

  • Plot 1 has even abundance across species. We will explore evenness in depth later on.

Distances

For simple distances to compare plots, we need to select only 2 species compositions to compare between plots because basic plots are 2-dimensional.

Let’s plot for species “triangle_sp” vs. “circle_sp” for plots 1, 3, and 4 from our comm_data_matrix.

df_tri_cir <- comm_data_matrix %>%
  select(triangle_sp, circle_sp) %>%
  mutate(plot_id = row_number()) %>%
  filter(plot_id == 1 | plot_id == 3 | plot_id == 4)

ggplot(data = df_tri_cir, aes(x = triangle_sp, y = circle_sp)) +
  geom_text(aes(label = plot_id)) +
  labs(x = "triangle_sp", y = "circle_sp", title = "Abundances of triangle_sp vs. circle_sp")

Can you tell by looking at this plot how different the relationship is between these species by site?
  • Perhaps 1 and 4 are different from each other, but it is hard to tell if 4 or 1 is more similar to plot 3

Euclidean Distance

We can calculate the distance between plots to quantify how different they are. In general, distances in ecology are primarily used to quantify dissimilarity. The first method is calculating the Euclidean distance between points. We just look at this x, y coordinate plot as a map. Therefore, this distance can be calculated with

\[ \sqrt{( x_{1} \ -\ x_{2})^{2} \ +\ ( y_{1} \ -\ y_{2})^{2}} \]

Calculate the Euclidean distance between plots 1 and 3 for our df df_tri_cir

x1 <- df_tri_cir$triangle_sp[1]
x2 <- df_tri_cir$triangle_sp[2]
y1 <- df_tri_cir$circle_sp[1]
y2 <- df_tri_cir$circle_sp[2]
# Calculate Euclidean distance

sqrt((x1 - x2)^2 + (y1 - y2)^2)
[1] 2

Repeat this for plots 3 and 4

x1 <- df_tri_cir$triangle_sp[2]
x2 <- df_tri_cir$triangle_sp[3]
y1 <- df_tri_cir$circle_sp[2]
y2 <- df_tri_cir$circle_sp[3]
sqrt((x1 - x2)^2 + (y1 - y2)^2)
[1] 2
What do these results imply?
  • Plots 1 and 4 seem to be equally similar to plot 3

Bray-Curtis Distance

However, this Euclidean distance can quickly focus too heavily on more abundant species over rare species. To prevent this by weighing abundant and rare species evenly, we can calculate the abundances between two sites divided by the total abundances at each site. This is called the Bray-Curtis distance.

\[ \sum \frac{|x_{ai} \ -\ x_{bi} |}{x_{ai} \ +\ x_{bi}} \]

The vegdist() function from the vegan package provides Bray-Curtis distance by default for an entire dataset. Filter for plots 1 and 3, select for “triangle_sp” and “circle_sp”, and use the vegdist() function with method = "euclidean" to check our previous answer. The slice() function from dplyr is a straightforward way to filter for rows based on row number.

df_tri_cir <- comm_data_matrix %>%
  select(triangle_sp, circle_sp) %>%
  slice(1, 3)
vegdist(df_tri_cir, method="euclidean")
  1
2 2

This is the same answer we got above.

With vegan we can complete these calculations easily for our whole dataset:

vegdist(comm_data_matrix, method="euclidean")
         1        2        3        4
2 4.472136                           
3 2.236068 2.645751                  
4 4.000000 4.000000 3.605551         
5 4.242641 7.615773 6.082763 7.348469

Now repeat using the Bray-Curtis distance

vegdist(comm_data_matrix, method="bray")
          1         2         3         4
2 0.4000000                              
3 0.1428571 0.2380952                    
4 0.4000000 0.4000000 0.3333333          
5 0.4000000 0.7000000 0.5238095 0.8000000
What do you notice in comparison to Euclidean distance results?
  • Even though following Euclidean distance plots 1 and 4 were equally similar to plot 3, following Bray-Curtis distance plots 3 and 4 are more similar than plots 1 and 3

Non-metric Multidimensional Scaling

We have focused on pairs of 2 species, but our dataframe has more than 2 columns. Let’s plot 3 columns (species “triangle_sp”, “circle_sp”, and “square_sp”) for plots 1, 3, and 4. Wrangle data to create df_tri_cir_sq object.

df_tri_cir_sq <- comm_data_matrix %>%
  select(triangle_sp, circle_sp, square_sp) %>%
  slice(1, 3, 4)

Now that we have 3 species, we can plot them in 3 dimensions.

# Convert to plotly 3D scatter plot
plot_ly(data = df_tri_cir_sq, x = ~triangle_sp, y = ~circle_sp, z = ~square_sp, type = 'scatter3d', mode = 'markers') %>%
  layout(scene = list(xaxis = list(title = 'triangle_sp'),
                      yaxis = list(title = 'circle_sp'),
                      zaxis = list(title = 'square_sp'),
                      camera = list(
                        eye = list(x = 0, y = 0, z = 2.5),
                        up = list(x = 0, y = 1, z = 0)
                      )))

At first this plot looks identical to our plot above of “triangle_sp” vs “circle_sp,” however when we rotate we can see the third dimension. Although this looks pretty cool, it is even harder to visually inspect how similar/different these plots are.

Now, how can we visualize 4 or more columns (species “triangle_sp”, “circle_sp”, “square_sp”, and “species_x”) to see their distance for plots 1, 3, and 4? How about 5 columns?

One technique that attempts to visualize an optimal arrangement is Non-metric Multidimensional Scaling (NMDS). Note that the axes of the plotted results are arbitrary (non-metric) but instead represent the relative weight of relationships.

NMDS is an example of an ordination technique, which allow us to arrange multidimensional data and reduce it to low-dimensional space based on their similarities or dissimilarities. This is very useful for visualizing and analyzing complex data.

All of these ordination methods do something more or less similar to the following visualization of a simple PCA, they look at multivariate point cloud from a certain angle and take a 2D slice through that plane:

Once again, vegan can help with this, among other options it can produce NMDS plots by either Euclidean or Bray-Curtis distance.

# Wrangle

df_tri_cir_sq_x <- comm_data_matrix %>%
  select(triangle_sp, circle_sp, square_sp, x_sp) %>%
  slice(1, 3, 4)
nmds_Euc <- metaMDS(df_tri_cir_sq_x, distance = "euclidean")
Warning in metaMDS(df_tri_cir_sq_x, distance = "euclidean"): stress is (nearly)
zero: you may have insufficient data
scores_plot_df <- as.data.frame(scores(nmds_Euc, display = "sites"))

scores_plot_df <- scores_plot_df %>%
  mutate(plot_id = c(1, 3, 4))
# Plot

ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
  geom_text(aes(label = plot_id)) +
  labs(title = "NMDS Plot with Euclidean Distance", x = "NMDS_1", y = "NMDS_2") 

Repeat for Bray-Curtis distance

# Repeat for Bray-Curtis distance

df_tri_cir_sq_x <- comm_data_matrix %>%
  select(triangle_sp, circle_sp, square_sp, x_sp) %>%
  slice(1, 3, 4)

nmds_Bray <- metaMDS(df_tri_cir_sq_x, distance = "bray")
Run 0 stress 0 
Run 1 stress 0 
... Procrustes: rmse 0.0685145  max resid 0.08513997 
Run 2 stress 0 
... Procrustes: rmse 0.0673008  max resid 0.08091399 
Run 3 stress 0 
... Procrustes: rmse 0.2473814  max resid 0.349626 
Run 4 stress 0 
... Procrustes: rmse 0.06474159  max resid 0.08166007 
Run 5 stress 0 
... Procrustes: rmse 0.07408706  max resid 0.08920374 
Run 6 stress 0 
... Procrustes: rmse 0.09968261  max resid 0.1344664 
Run 7 stress 0 
... Procrustes: rmse 0.06882658  max resid 0.0862992 
Run 8 stress 0 
... Procrustes: rmse 0.1826825  max resid 0.1879906 
Run 9 stress 0 
... Procrustes: rmse 0.1648719  max resid 0.2234559 
Run 10 stress 0 
... Procrustes: rmse 0.1335361  max resid 0.1461431 
Run 11 stress 0 
... Procrustes: rmse 0.1693015  max resid 0.2102284 
Run 12 stress 0 
... Procrustes: rmse 0.2507016  max resid 0.3082039 
Run 13 stress 0 
... Procrustes: rmse 0.1698787  max resid 0.1774413 
Run 14 stress 0 
... Procrustes: rmse 0.09633324  max resid 0.1227606 
Run 15 stress 0 
... Procrustes: rmse 0.08850214  max resid 0.1010874 
Run 16 stress 0 
... Procrustes: rmse 0.08261412  max resid 0.1101298 
Run 17 stress 0 
... Procrustes: rmse 0.1101204  max resid 0.148136 
Run 18 stress 2.085682e-05 
... Procrustes: rmse 0.1080145  max resid 0.1292131 
Run 19 stress 0 
... Procrustes: rmse 0.08904901  max resid 0.1191981 
Run 20 stress 0 
... Procrustes: rmse 0.161462  max resid 0.2214368 
*** Best solution was not repeated -- monoMDS stopping criteria:
    20: stress < smin
Warning in metaMDS(df_tri_cir_sq_x, distance = "bray"): stress is (nearly)
zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
scores_plot_df <- as.data.frame(scores(nmds_Euc, display = "sites"))

scores_plot_df <- scores_plot_df %>%
  mutate(plot_id = c(1, 3, 4))
ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
  geom_text(aes(label = plot_id)) +
  labs(title = "NMDS Plot with Bray-Curtis Distance", x = "NMDS_1", y = "NMDS_2") 

Compare this to our original 2D plot where we used triangle and circle species to try to compare plots, are there any differences that you notice?

df_tri_cir <- comm_data_matrix %>%
  select(triangle_sp, circle_sp) %>%
  mutate(plot_id = row_number()) %>%
  filter(plot_id == 1 | plot_id == 3 | plot_id == 4)

ggplot(data = df_tri_cir, aes(x = triangle_sp, y = circle_sp)) +
  geom_text(aes(label = plot_id)) +
  labs(x = "triangle_sp", y = "circle_sp", title = "Abundances of triangle_sp vs. circle_sp")

NMDS plots can go well beyond 4-dimensions. Because of this expandability, we can plot distances for all species in our sample simultaneously to identify which plots may be more different from one another as far as species abundances is concerned.

# Repeat creating NMDS using Bray-Curtis distance for all species

nmds_Bray <- metaMDS(comm_data_matrix, distance = "bray")
Run 0 stress 0 
Run 1 stress 0 
... Procrustes: rmse 0.1041114  max resid 0.1325716 
Run 2 stress 0 
... Procrustes: rmse 0.1489232  max resid 0.2743453 
Run 3 stress 8.912132e-05 
... Procrustes: rmse 0.2079909  max resid 0.3658446 
Run 4 stress 0 
... Procrustes: rmse 0.1099848  max resid 0.1856015 
Run 5 stress 0 
... Procrustes: rmse 0.09470483  max resid 0.1642575 
Run 6 stress 5.2717e-05 
... Procrustes: rmse 0.2132488  max resid 0.2750503 
Run 7 stress 0 
... Procrustes: rmse 0.1351432  max resid 0.1790358 
Run 8 stress 0 
... Procrustes: rmse 0.08050803  max resid 0.1121345 
Run 9 stress 0.0003124553 
... Procrustes: rmse 0.2211435  max resid 0.3031416 
Run 10 stress 0 
... Procrustes: rmse 0.08620733  max resid 0.1599822 
Run 11 stress 8.340072e-05 
... Procrustes: rmse 0.223066  max resid 0.291417 
Run 12 stress 0 
... Procrustes: rmse 0.06181389  max resid 0.0904574 
Run 13 stress 3.03667e-05 
... Procrustes: rmse 0.2417417  max resid 0.3402684 
Run 14 stress 0 
... Procrustes: rmse 0.1491092  max resid 0.1700833 
Run 15 stress 0 
... Procrustes: rmse 0.2008313  max resid 0.279812 
Run 16 stress 8.998705e-05 
... Procrustes: rmse 0.2576293  max resid 0.3560075 
Run 17 stress 0 
... Procrustes: rmse 0.1989037  max resid 0.2812514 
Run 18 stress 0 
... Procrustes: rmse 0.1637751  max resid 0.2166281 
Run 19 stress 0 
... Procrustes: rmse 0.1213147  max resid 0.2018581 
Run 20 stress 0 
... Procrustes: rmse 0.05847153  max resid 0.1008559 
*** Best solution was not repeated -- monoMDS stopping criteria:
     1: no. of iterations >= maxit
    19: stress < smin
Warning in metaMDS(comm_data_matrix, distance = "bray"): stress is (nearly)
zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
scores_plot_df <- as.data.frame(scores(nmds_Bray, display = "sites"))

scores_plot_df <- scores_plot_df %>%
  mutate(plot_id = c(1, 2, 3, 4, 5))

ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
  geom_text(aes(label = plot_id)) +
  labs(title = "NMDS Plot with Bray-Curtis Distance", x = "NMDS_1", y = "NMDS_2") 

What hypotheses can we make based on this plot?
  • We can make various hypotheses, such as that plots 1 and 3 are more similar to one another than to the other plots regrading overall species abundances

Now let’s quantify these results by finding the actual distances between these points on our ordination plot:

scores <- scores(nmds_Bray, display = "sites")

vegdist(scores, method="euclidean")
          1         2         3         4
2 0.3987570                              
3 0.1638629 0.2371359                    
4 0.3998220 0.4004870 0.3281351          
5 0.4003879 0.7003845 0.5205948 0.8000000
What do these results imply?
  • Plots 1 and 3 are the most similar
  • Plots 4 and 5 are the most different from one another
  • When looking at plot 1 where the distance to the other plots is hard to discern visually, it is closest to 2, then 4, then 5, though the distance is very similar between them and may not significantly different between these pairs

Again, NMDS is an example of an ordination technique, other examples include:

  • Principal Component Analysis (PCA) is primarily used for uncovering patterns in data by reducing its dimensions linearly, without considering external variables.
  • Canonical Correspondence Analysis (CCA) to explore the relationships between two sets of variables, particularly when one set is hypothesized to influence the other.
  • Principal Coordinates Analysis (PCoA) focuses on preserving the pairwise distances between observations when data is reduced to a lower-dimensional space.

These ordination techniques are used to take into account many aspects, including environmental variables, to make comparisons using multiple dimensions.


Diversity Indices

Let’s bring our understanding together to quantify diversity using indices. Diversity is challenging to quantify because it is difficult to agree on a definition. For example, we could attempt to quantify biodiversity with any of the following:

  • Probability of encountering a species at random
  • Uncertainty of possible community states
  • Variance in species composition between sites

Species Richness Index

The most common biodiversity index is the Species Richness Index, counting the number of unique species in a community. We already completed this task above, calling it \(\gamma\) diversity for our system:

species_richness <-  df %>% 
          select(species_id) %>%
          n_distinct()

species_richness
[1] 5

However, it is highly sensitive to variations in sampling effort because it treats all species equally, weighing rare and common species the same despite the greater likelihood of rare species going undetected.

It does not account for evenness: the fact that some species in the community are common and others are rare.

Source: davidzeleny.net
What do you notice about the examples above?
  • They have equal species richness, however the communities are very different based on evenness.

Shannon and Simpson Diversity Indices

The Shannon(Shannon 1948) and Simpson Diversity Indices(Simpson 1949), both special cases of the Hill’s Number(Hill 1973), are very popular biodiversity indices because they focus on both richness and evenness.

\[ \begin{array}{l} Shannon-Wiener\ Diversity\ =-\sum _{i\ =\ i}^{S} p_{i}\ln p_{i}\\ \\ Simpson\ Diversity\ =\frac{1}{\sum _{i\ =\ 1}^{S} \ p_{i}^{2}}\\ \\ p_{i} \ =\ relative\ abundance\ of\ species\ i\\ S\ =\ total\ number\ of\ species \end{array} \]

What’s in a name?

You will likely see the Shannon Index referred to in different ways throughout the literature because of articles/books that were published on the topic by three different mathematicians over the course of a couple of years. Here is a quick guide to the alternative names you will encounter:

As richness and evenness increase, these indices increase. It is important to remember that Simpson’s Diversity Index emphasizes common species, while the Shannon Diversity Index is biased towards rare species. Here is how you can interpret these results:

Shannon’s Index interpretation:

  • Based in information theory and represents the uncertainty that we will be able to predict which species will be randomly drawn from a community.
  • Values of \(H\) (Shannon Entropy) are usually between 1.5-3.5 (the units are bits of information)
  • The maximum value of \(H\) represents a perfectly even community

Simpson’s and Gini-Simpson’s Indeces interpretation:

  • Represents the probability that two randomly selected individuals are from the same species.
  • Simpson’s Index decreases with increasing richness, because of this Gini-Simpson index (\(1 - D\)) is often used to make value more comparable with other indices.
  • The values of \(D\) range from 0 to 1

Because Simpson’s/Gini-Simpson’s indices refer more directly to richness than evenness, we can also calculate Simpson’s Evenness (“equitability”) where \(S\) is the number of species:

\[ \frac{1}{D S} \]

Now let’s calculate these values for our dataset. Thanks once again to the vegan package, these equations are readily available through the function diversity(). Here is the ?vegan::diversity() help page:

Here is another nice summary as well explaining what exactly the diversity() function is actually calculating:

Source: davidzeleny.net

Wrangle our dataframe comm_data_matrix as needed, then compare the richness, Gini-Simpson Index, Simpson’s Evenness, and Shannon Index values for plots 1 and 2.

# Wrangle data
cdm_1 <- comm_data_matrix %>%
  slice(1)

cdm_2 <- comm_data_matrix %>%
  slice(2)

cdm_1
# A tibble: 1 × 5
  triangle_sp square_sp cross_sp circle_sp  x_sp
        <dbl>     <dbl>    <dbl>     <dbl> <dbl>
1           2         2        2         2     2
cdm_2
# A tibble: 1 × 5
  triangle_sp square_sp cross_sp circle_sp  x_sp
        <dbl>     <dbl>    <dbl>     <dbl> <dbl>
1           6         1        1         1     1
# Calculate richness

species_richness_1 <- specnumber(cdm_1)

species_richness_2 <- specnumber(cdm_2)

species_richness_1
[1] 5
species_richness_2
[1] 5
# Calculate Gini-Simpson's diversity

diversity(cdm_1, index = "simpson")
[1] 0.8
diversity(cdm_2, index = "simpson")
[1] 0.6
# Calculate Simpson's Evenness
Gs_1 <- diversity(cdm_1, index = "simpson")
Gs_2 <- diversity(cdm_2, index = "simpson")

D_1 <- 1 - Gs_1
D_2 <- 1 - Gs_2

1 / (D_1 * ncol(cdm_1))
[1] 1
1 / (D_2 * ncol(cdm_2))
[1] 0.5
# Calculate Shannon's diversity:

diversity(cdm_1, index = "shannon")
[1] 1.609438
diversity(cdm_2, index = "shannon")
[1] 1.227529
How would you interpret these results?
  • Plots 1 and 2 have the same richness and very similar Gini-Simpson Index values which is related to measuring richness. However, for Shannon Entropy and even more so for Simpson’s Evenness the values the values are very distinct. Plots 1 is very even and plot 2 is very un-even. Without checking for evenness we might have missed how distinct these two communities are with plot 1 being very even and plot 2 being dominated by a single species (triangle sp.)

Now we can go through our analyses throughout this lab to answer the following questions:

Which site is most rich?
  • They are all equally rich (5 species)
Which site has the least turnover?
  • Plot 1
Which site is most even?
  • Plot 1
Which sites are most different from one another following richness-based methods?
  • Plots 4 and 5
Which sites are most similar to one another following richness-based methods?
  • Plots 1 and 3
Is there a dominant species? If so, which is it?
  • Triangle species (especially in plot 2)
Which sites are most different from one another following abundance-based methods?
  • Plots 4 and 5
Which sites are most similar to one another following abundance-based methods?
  • Plots 1 and 3

References

Anderson, Marti J, Thomas O Crist, Jonathan M Chase, Mark Vellend, Brian D Inouye, Amy L Freestone, Nathan J Sanders, et al. 2011. “Navigating the Multiple Meanings of \(\beta\) Diversity: A Roadmap for the Practicing Ecologist.” Ecology Letters 14 (1): 19–28.
Hill, Mark O. 1973. “Diversity and Evenness: A Unifying Notation and Its Consequences.” Ecology 54 (2): 427–32.
Mittelbach, Gary G, and Brian J McGill. 2019. Community Ecology. Oxford University Press.
Oksanen, Jari, Gavin L. Simpson, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O’Hara, et al. 2022. “Vegan: Community Ecology Package.” https://CRAN.R-project.org/package=vegan.
Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423.
Simpson, Edward H. 1949. “Measurement of Diversity.” Nature 163 (4148): 688–88.
Stevens, Martin Henry Hoffman. 2009. A Primer of Ecology with r. Springer.
Whittaker, Robert Harding. 1960. “Vegetation of the Siskiyou Mountains, Oregon and California.” Ecological Monographs 30 (3): 279–338.
Wickham, Hadley. 2016. “Ggplot2: Elegant Graphics for Data Analysis.” https://ggplot2.tidyverse.org.
Wickham, Hadley, and Maintainer Hadley Wickham. 2017. “Package ‘Tidyr’.” Easily Tidy Data with’spread’and’gather ()’Functions.