Lab 7

Authors

Charles Lehnen

Melissa Guzman

Cameron Egan

Biodiversity

There are many ways to measure biodiversity and methods to compare the biodiversity between populations or communities. The simplest two are often considered to be abundance and richness.

Richness

Do you remember the difference between richness and abundance?

  • Richness is the number of species in a community while abundance is the number of individuals from each species in that community.

Do you remember the three ways we have talked about measuring diversity this semester?

  • Alpha, beta, and gamma diversity

Gamma Diversity

It’s time to collect data along a transect! Create a dataframe for this dataset across the five vegetation zones which follow an elevation gradient:

# Directly creating the long format dataframe
df <- data.frame(
  plot_id = rep(c(1, 2, 3, 4, 5), each = 5),
  species_id = rep(c("triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp"), times = 5),
  count = c(2, 2, 2, 2, 2,
            6, 1, 1, 1, 1,
            4, 1, 2, 2, 2,
            4, 2, 0, 4, 0,
            0, 5, 3, 0, 2)
)
head(df)
  plot_id  species_id count
1       1 triangle_sp     2
2       1   square_sp     2
3       1    cross_sp     2
4       1   circle_sp     2
5       1        x_sp     2
6       2 triangle_sp     6

What is the species richness of our region?

# There are various ways to accomplish this. Here are a couple examples of methods

gamma <- length(unique(df$species_id))

gamma
[1] 5
gamma <-  df %>% 
          select(species_id) %>%
          n_distinct()

gamma
[1] 5
  • 5

This total count of species richness across a region is referred to as \(\gamma\) diversity.

Alpha Diversity

Similarly, \(\alpha\) diversity is the richness of a local area or a single sample.

Calculate the \(\alpha\) diversity of plots 4 and 5

alpha_4 <- df %>% 
  filter(plot_id == 4) %>%
  select(species_id) %>%
  n_distinct()

alpha_5 <- df %>% 
  filter(plot_id == 5) %>%
  select(species_id) %>%
  n_distinct()

alpha_4
[1] 5
alpha_5
[1] 5

How similar are these plots? How distinct are they? Is this value or high or low for our region?

  • \(\alpha\) diversity can start to answer these questions, but doesn’t provide too clear of an answer.
  • Additionally, note that the species in plots 4 and 5 are extremely different, even though their richness is identical

Calculating \(\beta\) diversity can begin to help us explore these nuances.

Beta Diversity

\(\beta\) diversity can be defined as the number of species in a region not observed in sample or the turnover in species composition among sites. Therefore both \(\gamma\) (regional) and \(\alpha\) (site) diversity are directly related to \(\beta\) diversity.

\(\beta\) diversity is often calculated via additive partitioning.

\[ \beta \ =\ \gamma \ -\ \overline{\alpha } \]

Where \(\overline{\alpha }\) is the average \(\alpha\) diversity of the sample. Here, \(\beta\) diversity is the average number of species not found in a sample even though they are found in a region.

It is also commonly calculated using multiplicative partitioning:

\[ \beta \ =\ \frac{\gamma }{\overline{\alpha }} \]

This metric for turnover aids in determining how many unique assemblages exist in a region. It is similar to the original definition coined by Whittaker.

As with any step of model selection, you must be intentional about which version of \(\beta\) diversity is applicable to your question and sample.

We will calculate the beta diversity of sites 4 and 5. Once again, we will be using the vegan package to help with this.

First, we will need to reformat the data into the format that vegan prefers called a “Community data matrix,” with species names across the X-axis of the dataframe, sampling event down the Y-axis, and abundances where they meet. We will widen the dataframe by species name. In general data collection and data presentation should be done in a long format since it is more human-readable, whereas analysis should be done in the long format since it is more easily computer readable. To do so, we will use the tidyR pivot_wider() function.

comm_data_matrix <- df %>%
  pivot_wider(names_from = species_id, values_from = count, values_fill = 0)

head(comm_data_matrix)
# A tibble: 5 × 6
  plot_id triangle_sp square_sp cross_sp circle_sp  x_sp
    <dbl>       <dbl>     <dbl>    <dbl>     <dbl> <dbl>
1       1           2         2        2         2     2
2       2           6         1        1         1     1
3       3           4         1        2         2     2
4       4           4         2        0         4     0
5       5           0         5        3         0     2

To calculate \(\overline{\alpha }\), use the vegan function specnumber():

alpha_bar <- mean(specnumber(df))

This will need to be done after the dataframe has only species columns, because that is the format vegan prefers.

With this information, calculate beta diversity for sites 1, 4, and 5.

# Wrangle data
df_1 <- comm_data_matrix %>%
  filter(plot_id == 1) %>%
  select(-plot_id) # Vegan only likes species columns

df_4 <- comm_data_matrix %>%
  filter(plot_id == 4) %>%
  select(-plot_id)

df_5 <- comm_data_matrix %>%
  filter(plot_id == 5) %>%
  select(-plot_id)


# Calculate alpha bar for each plot

alpha_bar_1 <- mean(specnumber(df_1))
alpha_bar_4 <- mean(specnumber(df_4))
alpha_bar_5 <- mean(specnumber(df_5))


# Calculate additive beta diversity for each plot
beta_add_1 <- gamma - alpha_bar_1
beta_add_4 <- gamma - alpha_bar_4
beta_add_5 <- gamma - alpha_bar_5


# Calculate multiplicative beta diversity for each plot
beta_mult_1 <- gamma / alpha_bar_1
beta_mult_4 <- gamma / alpha_bar_4
beta_mult_5 <- gamma / alpha_bar_5


# Output the results
paste("gamma:", gamma)
[1] "gamma: 5"
print("alpha_bar")
[1] "alpha_bar"
paste("alpha_bar_1:", alpha_bar_1)
[1] "alpha_bar_1: 5"
paste("alpha_bar_4:", alpha_bar_4)
[1] "alpha_bar_4: 3"
paste("alpha_bar_5:", alpha_bar_5)
[1] "alpha_bar_5: 3"
print("beta_add")
[1] "beta_add"
paste("beta_add_1:", beta_add_1) 
[1] "beta_add_1: 0"
paste("beta_add_4:", beta_add_4)
[1] "beta_add_4: 2"
paste("beta_add_5:", beta_add_5)
[1] "beta_add_5: 2"
print("beta_mult_1")
[1] "beta_mult_1"
paste("beta_mult_1:", beta_mult_1)
[1] "beta_mult_1: 1"
paste("beta_mult_4:", beta_mult_4)
[1] "beta_mult_4: 1.66666666666667"
paste("beta_mult_5:", beta_mult_5)
[1] "beta_mult_5: 1.66666666666667"

What do these results imply?

  • These \(\beta\) diversity values for plot 1 are at the maximum, the richness of speices in this plot is represented of the whole region.

  • The \(\beta\) diversity values for plots 4 and 5 are equal, even though the actual species in these plots are very different

  • These \(\beta\) diversity values are useful because they are more readily comparable to other regions because they are normalized for the regional diversity (\(\gamma\) diversity)

  • Because \(\beta \ <\ \overline{\alpha }\), each plot contains more species from the region than it is lacking.


Abundance

Richness is the number of species in a community while abundance is the number of individuals from each species in that community.

How can we find abundance within the community we are studying?

  • We need to count the number of individuals detected from each species.

Use dplyr to group our wide dataframe df by species_id and use summarize() to create a column of total abundances called count

abundances <- df %>%
  group_by(species_id) %>% 
  summarize(count = sum(count)) %>%
  ungroup()
            
abundances
# A tibble: 5 × 2
  species_id  count
  <chr>       <dbl>
1 circle_sp       9
2 cross_sp        8
3 square_sp      11
4 triangle_sp    16
5 x_sp            7

Plot the abundances of our new dataframe abundances as a barplot using ggplot2.

ggplot(data = abundances, aes(x = reorder(species_id, -count), y = count)) +
  geom_col() +
  labs(x = "Species", y = "Total Count Across Regions")

What do you notice about the abundances?

  • The species “triangle_sp” is the dominant species, the others are more similar in proportion

Rank Abundance Curves

Rank abundance curves are a common way to visualize the distribution of species abundances in a community. They are a type of species accumulation curve, which are used to estimate the number of species in a community. They give you a quick way to compare how similar/different communities are just based on simple abundances.

df_ranked <- df %>%
  group_by(plot_id, species_id) %>%
  summarize(total_count = sum(count)) %>%
  arrange(plot_id, desc(total_count)) %>%
  mutate(rank = row_number()) %>%
  ungroup()
`summarise()` has grouped output by 'plot_id'. You can override using the
`.groups` argument.
ggplot(df_ranked, aes(x = rank, y = total_count, color = factor(plot_id))) +
  geom_point() +
  geom_line() +
  scale_x_continuous(trans = "log10", breaks = c(1, 10, 100), minor_breaks = NULL) +
  scale_y_continuous(trans = "log10", breaks = c(1, 10, 100), minor_breaks = NULL) +
  labs(x = "Abundance Rank (log scale)", y = "Abundance (log scale)", color = "Plot ID")

What do you notice?

  • Plot 2 is dominated by a single species. In the real world, many rank abundance curves take a similar shape with few dominant species and many rare species

  • Plot 1 has even abundance across species. We will explore evenness in depth later on.

Distances

Let’s plot for species “triangle_sp” vs. “circle_sp” for plots 1, 3, and 4 from our comm_data_matrix.

df_tri_cir <- comm_data_matrix %>%
  dplyr::select(plot_id, triangle_sp, circle_sp) %>%
  filter(plot_id == 1 | plot_id == 3 | plot_id == 4)

ggplot(data = df_tri_cir, aes(x = triangle_sp, y = circle_sp)) +
  geom_text(aes(label = plot_id)) +
  labs(x = "triangle_sp", y = "circle_sp", title = "Abundances of triangle_sp vs. circle_sp")

Can you tell by looking at this plot how different the relationship is between these species by site?

  • Perhaps 1 and 4 are different from each other, but it is hard to tell if 4 or 1 is more similar to plot 3

Euclidean Distance

We can calculate the distance between plots to quantify how different they are. In general, distances in ecology are primarily used to quantify dissimilarity. The first method is calculating the Euclidean distance between points. We just look at this x, y coordinate plot as a map. Therefore, this distance can be calculated with

\[ \sqrt{( x_{1} \ -\ x_{2})^{2} \ +\ ( y_{1} \ -\ y_{2})^{2}} \]

Calculate the Euclidean distance between plots 1 and 3 for our df df_tri_cir

x1 <- df_tri_cir$triangle_sp[1]
x2 <- df_tri_cir$triangle_sp[2]
y1 <- df_tri_cir$circle_sp[1]
y2 <- df_tri_cir$circle_sp[2]

sqrt((x1 - x2)^2 + (y1 - y2)^2)
[1] 2

Repeat this for plots 3 and 4

x1 <- df_tri_cir$triangle_sp[2]
x2 <- df_tri_cir$triangle_sp[3]
y1 <- df_tri_cir$circle_sp[2]
y2 <- df_tri_cir$circle_sp[3]

sqrt((x1 - x2)^2 + (y1 - y2)^2)
[1] 2

What do these results imply?

  • Plots 1 and 4 seem to be equally similar to plot 3

Bray-Curtis/Sørenson Distance

However, this Euclidean distance can quickly focus heavily on more abundant species over rare species. To prevent this by weighing abundant and rare species evenly, we can calculate the abundances between two sites divided by the total abundances at each site. This is called the Bray-Curtis or Sørenson Distance.

\[ \sum \frac{|x_{ai} \ -\ x_{bi} |}{x_{ai} \ +\ x_{bi}} \]

To help us do so, we will introduce you to the currently most popular community ecology package in R, vegan (Oksanen et al. 2022).

The package vegan frequently takes community data matrices as the input and require columns to only represent species, so you must select() for only species columns before passing a dataframe to vegan.

The vegdist() function from the vegan package provides Bray-Curtis distance by default for an entire dataset. Filter for plots 1 and 3, select for “triangle_sp” and “circle_sp”, and use the vegdist() function with method = "euclidean" to check our previous answer:

df_tri_cir_dcm <- comm_data_matrix %>%
  dplyr::select(plot_id, triangle_sp, circle_sp) %>%
  filter(plot_id == 1 | plot_id == 3) %>%
  dplyr::select(triangle_sp, circle_sp)

vegdist(df_tri_cir_dcm, method="euclidean")
  1
2 2

Now repeat using the Bray-Curtis distance

vegdist(df_tri_cir_dcm, method="bray")
    1
2 0.2

Repeat for plots 3 versus 4, what do you notice in comparison to Euclidean distance results?

df_tri_cir_dcm <- comm_data_matrix %>%
  dplyr::select(plot_id, triangle_sp, circle_sp) %>%
  filter(plot_id == 3 | plot_id == 4) %>%
  dplyr::select(triangle_sp, circle_sp)

vegdist(df_tri_cir_dcm, method="bray")
          1
2 0.1428571
  • Even though following Euclidean distance plots 1 and 4 were equally similar to plot 3, following Bray-Curtis distance plots 3 and 4 are more similar than plots 1 and 3

Non-metric Multidimensional Scaling

We have focused on pairs of 2 species, but our dataframe has more than 2 columns. Let’s plot 3 columns (species “triangle_sp”, “circle_sp”, and “square_sp”) for plots 1, 3, and 4.

df_tri_cir_sq <- comm_data_matrix %>%
  dplyr::select(plot_id, triangle_sp, circle_sp, square_sp) %>%
  filter(plot_id == 1 | plot_id == 3 | plot_id == 4)

# Convert to plotly 3D scatter plot
plot_ly(data = df_tri_cir_sq, x = ~triangle_sp, y = ~circle_sp, z = ~square_sp, type = 'scatter3d', mode = 'markers') %>%
  layout(scene = list(xaxis = list(title = 'triangle_sp'),
                      yaxis = list(title = 'circle_sp'),
                      zaxis = list(title = 'square_sp'),
                      camera = list(
                        eye = list(x = 0, y = 0, z = 2.5),
                        up = list(x = 0, y = 1, z = 0)
                      )))

At first this plot looks identical to our plot above of “triangle_sp” vs “circle_sp,” however when we rotate we can see the third dimension. Although this is pretty cool, it is even harder to visually inspect how similar/different these plots are.

Now, how can we visualize 4 or more columns (species “triangle_sp”, “circle_sp”, “square_sp”, and “species_x”) to see their distance for plots 1, 3, and 4? How about 5 columns?

One technique that attempts to visualize an optimal arrangement is Non-metric Multidimensional Scaling (NMDS). Note that the axes of the plotted results are arbitrary (non-metric) but instead represent the relative weight of relationships.

NMDS is an example of an ordination technique, which allow us to arrange multidimensional data and reduce it to low-dimensional space based on their similarities or dissimilarities. This is very useful for visualizing and analyzing complex data.

All of these ordination methods do something more or less similar to the following visualization of a simple PCA, they look at multivariate point cloud from a certain angle and take a 2D slice through that plane:

Once again, vegan can help with this, among other options it can produce NMDS plots by either Euclidean or Bray-Curtis distance.

df_tri_cir_sq_x <- comm_data_matrix %>%
  dplyr::select(plot_id, triangle_sp, circle_sp, square_sp, x_sp) %>%
  filter(plot_id == 1 | plot_id == 3 | plot_id == 4)

nmds_Euc <- metaMDS(df_tri_cir_sq_x %>% select(-plot_id), distance = "euclidean")
Run 0 stress 0 
Run 1 stress 0 
... Procrustes: rmse 0.07171074  max resid 0.08708606 
Run 2 stress 0 
... Procrustes: rmse 0.03748834  max resid 0.04495218 
Run 3 stress 0 
... Procrustes: rmse 0.1470433  max resid 0.1858873 
Run 4 stress 0 
... Procrustes: rmse 0.2956769  max resid 0.4175908 
Run 5 stress 0 
... Procrustes: rmse 0.0490357  max resid 0.05851162 
Run 6 stress 0 
... Procrustes: rmse 0.1516245  max resid 0.2033164 
Run 7 stress 0 
... Procrustes: rmse 0.09362361  max resid 0.1028682 
Run 8 stress 0 
... Procrustes: rmse 0.02580521  max resid 0.02974746 
Run 9 stress 0 
... Procrustes: rmse 0.09593518  max resid 0.1029452 
Run 10 stress 0 
... Procrustes: rmse 0.2419987  max resid 0.3381087 
Run 11 stress 0 
... Procrustes: rmse 0.2578759  max resid 0.3545848 
Run 12 stress 0 
... Procrustes: rmse 0.2412986  max resid 0.2955211 
Run 13 stress 0 
... Procrustes: rmse 0.1268612  max resid 0.1581423 
Run 14 stress 0 
... Procrustes: rmse 0.05068433  max resid 0.05982502 
Run 15 stress 0 
... Procrustes: rmse 0.08900928  max resid 0.09679533 
Run 16 stress 0 
... Procrustes: rmse 0.08125188  max resid 0.08834711 
Run 17 stress 0 
... Procrustes: rmse 0.0829068  max resid 0.0880893 
Run 18 stress 0 
... Procrustes: rmse 0.03940576  max resid 0.04417649 
Run 19 stress 0 
... Procrustes: rmse 0.03521958  max resid 0.03982364 
Run 20 stress 0 
... Procrustes: rmse 0.2683186  max resid 0.3755661 
*** Best solution was not repeated -- monoMDS stopping criteria:
    20: stress < smin
Warning in metaMDS(df_tri_cir_sq_x %>% select(-plot_id), distance =
"euclidean"): stress is (nearly) zero: you may have insufficient data
scores_plot_df <- as.data.frame(scores(nmds_Euc, display = "sites"))
scores_plot_df$plot_id <- df_tri_cir_sq_x$plot_id

ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
  geom_text(aes(label = plot_id)) +
  labs(title = "NMDS Plot with Euclidean Distance", x = "NMDS_1", y = "NMDS_2") 

# Repeat for Bray-Curtis distance

nmds_Bray <- metaMDS(df_tri_cir_sq_x %>% select(-plot_id), distance = "bray")
Run 0 stress 0 
Run 1 stress 0 
... Procrustes: rmse 0.1026729  max resid 0.138582 
Run 2 stress 0 
... Procrustes: rmse 0.03238101  max resid 0.04118483 
Run 3 stress 0 
... Procrustes: rmse 0.2146753  max resid 0.3011667 
Run 4 stress 0 
... Procrustes: rmse 0.02246243  max resid 0.02763603 
Run 5 stress 0 
... Procrustes: rmse 0.1077626  max resid 0.1198168 
Run 6 stress 0 
... Procrustes: rmse 0.1215243  max resid 0.1657981 
Run 7 stress 0 
... Procrustes: rmse 0.100309  max resid 0.1130419 
Run 8 stress 0 
... Procrustes: rmse 0.1057652  max resid 0.1237793 
Run 9 stress 0 
... Procrustes: rmse 0.1455311  max resid 0.16061 
Run 10 stress 0 
... Procrustes: rmse 0.1924084  max resid 0.2401998 
Run 11 stress 0 
... Procrustes: rmse 0.07820513  max resid 0.09101498 
Run 12 stress 0 
... Procrustes: rmse 0.09694032  max resid 0.1122681 
Run 13 stress 0 
... Procrustes: rmse 0.1898867  max resid 0.2449223 
Run 14 stress 0 
... Procrustes: rmse 0.1942403  max resid 0.2607276 
Run 15 stress 0 
... Procrustes: rmse 0.1619204  max resid 0.197488 
Run 16 stress 0 
... Procrustes: rmse 0.08279267  max resid 0.09962706 
Run 17 stress 0 
... Procrustes: rmse 0.1924818  max resid 0.2684008 
Run 18 stress 0 
... Procrustes: rmse 0.1587719  max resid 0.2199899 
Run 19 stress 0 
... Procrustes: rmse 0.1780853  max resid 0.1856434 
Run 20 stress 0 
... Procrustes: rmse 0.2100083  max resid 0.2605126 
*** Best solution was not repeated -- monoMDS stopping criteria:
    20: stress < smin
Warning in metaMDS(df_tri_cir_sq_x %>% select(-plot_id), distance = "bray"):
stress is (nearly) zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
scores_plot_df <- as.data.frame(scores(nmds_Bray, display = "sites"))
scores_plot_df$plot_id <- df_tri_cir_sq_x$plot_id

ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
  geom_text(aes(label = plot_id)) +
  labs(title = "NMDS Plot with Bray-Curtis Distance", x = "NMDS_1", y = "NMDS_2") 

Compare this to our original plot, are there any differences that you notice?

df_tri_cir <- comm_data_matrix %>%
  dplyr::select(plot_id, triangle_sp, circle_sp) %>%
  filter(plot_id == 1 | plot_id == 3 | plot_id == 4)

ggplot(data = df_tri_cir, aes(x = triangle_sp, y = circle_sp)) +
  geom_text(aes(label = plot_id)) +
  labs(x = "triangle_sp", y = "circle_sp", title = "Abundances of triangle_sp vs. circle_sp")

NMDS plots can go well beyond 4-dimensions, because of this expandability, we can plot distances for all species in our sample simultaneously to identify which plots may be more different from one another as far as species abundances is concerned.

# Repeat creating NMDS using Bray-Curtis distance for all species

nmds_Bray <- metaMDS(comm_data_matrix %>% select(-plot_id), distance = "bray")
Run 0 stress 0 
Run 1 stress 0 
... Procrustes: rmse 0.2030446  max resid 0.2747318 
Run 2 stress 0 
... Procrustes: rmse 0.07315567  max resid 0.1075529 
Run 3 stress 5.826605e-05 
... Procrustes: rmse 0.1376449  max resid 0.1645601 
Run 4 stress 0 
... Procrustes: rmse 0.1687072  max resid 0.2028889 
Run 5 stress 0 
... Procrustes: rmse 0.1949457  max resid 0.2825307 
Run 6 stress 6.16511e-05 
... Procrustes: rmse 0.1878843  max resid 0.3295183 
Run 7 stress 0 
... Procrustes: rmse 0.1573602  max resid 0.2940581 
Run 8 stress 0 
... Procrustes: rmse 0.08422943  max resid 0.1286996 
Run 9 stress 0 
... Procrustes: rmse 0.1081722  max resid 0.1791993 
Run 10 stress 9.905905e-05 
... Procrustes: rmse 0.2062697  max resid 0.3621891 
Run 11 stress 0 
... Procrustes: rmse 0.07446945  max resid 0.09334791 
Run 12 stress 0 
... Procrustes: rmse 0.07404918  max resid 0.09043398 
Run 13 stress 7.930351e-05 
... Procrustes: rmse 0.1300944  max resid 0.2363241 
Run 14 stress 5.163153e-05 
... Procrustes: rmse 0.1899391  max resid 0.2320315 
Run 15 stress 0 
... Procrustes: rmse 0.1326119  max resid 0.2120979 
Run 16 stress 0 
... Procrustes: rmse 0.08030282  max resid 0.1132989 
Run 17 stress 0 
... Procrustes: rmse 0.05150546  max resid 0.0877158 
Run 18 stress 9.904681e-05 
... Procrustes: rmse 0.2449894  max resid 0.326838 
Run 19 stress 0 
... Procrustes: rmse 0.09960809  max resid 0.1500808 
Run 20 stress 0 
... Procrustes: rmse 0.1434484  max resid 0.2017473 
*** Best solution was not repeated -- monoMDS stopping criteria:
    20: stress < smin
Warning in metaMDS(comm_data_matrix %>% select(-plot_id), distance = "bray"):
stress is (nearly) zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
scores_plot_df <- as.data.frame(scores(nmds_Bray, display = "sites"))
scores_plot_df$plot_id <- comm_data_matrix$plot_id

ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
  geom_text(aes(label = plot_id)) +
  labs(title = "NMDS Plot with Bray-Curtis Distance", x = "NMDS_1", y = "NMDS_2") 

What hypotheses can we make based on this plot?

  • We can make various hypotheses, such as that plots 1 and 3 are more similar to one another than to the other plots regrading overall species abundances

Again, NMDS is an example of an ordination technique, other examples include:

  • Principal Component Analysis (PCA) is primarily used for uncovering patterns in data by reducing its dimensions linearly, without considering external variables.
  • Canonical Correspondence Analysis (CCA) to explore the relationships between two sets of variables, particularly when one set is hypothesized to influence the other.
  • Principal Coordinates Analysis (PCoA) focuses on preserving the pairwise distances between observations when data is reduced to a lower-dimensional space.

Diversity Indices

Let’s bring our understanding together to quantify diversity using indices. Diversity is challenging to quantify because it is difficult to agree on a definition. For example, we could attempt to quantify biodiversity with any of the following:

  • Probability of encountering a species at random
  • Uncertainty of possible community states
  • Variance in species composition between sites

Species Richness Index

The most common biodiversity index is the Species Richness Index, counting the number of unique species in a community. We already completed this task above, calling it \(\gamma\) diversity for our system:

species_richness <-  df %>% 
          select(species_id) %>%
          n_distinct()

species_richness
[1] 5
species_richness <- ncol(comm_data_matrix) - 1

species_richness
[1] 5

However, it is highly sensitive to variations in sampling effort because it treats all species equally, weighing rare and common species equally despite the greater likelihood of rare species going undetected.

It does not account for evenness: the fact that some species in the community are common and others are rare.

Source: davidzeleny.net

What do you notice about the examples above?

  • They have equal species richness, however the communities are very different based on evenness.

Shannon and Simpson Diversity Indices

The Shannon(Shannon 1948) and Simpson Diversity Indices(Simpson 1949), both special cases of the Hill’s Number(Hill 1973), are very popular biodiversity indices because they focus on both richness and evenness.

\[ \begin{array}{l} Shannon-Wiener\ Diversity\ =-\sum _{i\ =\ i}^{S} p_{i}\ln p_{i}\\ \\ Simpson\ Diversity\ =\frac{1}{\sum _{i\ =\ 1}^{S} \ p_{i}^{2}}\\ \\ p_{i} \ =\ relative\ abundance\ of\ species\ i\\ S\ =\ total\ number\ of\ species \end{array} \]

What’s in a name?

You will likely see the Shannon Index referred to in different ways throughout the literature because of articles/books that were published on the topic by three different mathematicians over the course of a couple of years. Here is a quick guide to the alternative names you will encounter:

As richness and evenness increase, these indices increase. It is important to remember that Simpson’s Diversity Index emphasizes common species, while the Shannon Diversity Index is biased towards rare species. Here is how you can interpret these results:

Shannon’s Index interpretation:

  • Shannon’s Index is based in information theory and represents the uncertainty that we will be able to predict which species will be randomly drawn from a community.
  • Values of \(H\) (Shannon Entropy) are usually between 1.5-3.5 (the units are bits of information)
  • The maximum value of \(H\) represents a perfectly even community

Simpson’s Index interpretation:

  • Simpson’s Index represents the probability that two randomly selected individuals are from the same species.
  • Simpson’s Index decreases with increasing richness, because of this Gini-Simpson index (\(1 - D\)) is often used to make value more comparable with other indices.
  • The values of \(D\) range from 0 to 1

Because Simpson’s/Gini-Simpson’s indices refer more directly to richness than evenness, we can also calculate Simpson’s Evenness (“equitability”) where \(S\) is the number of species:

\[ \frac{1}{D S} \]

Now let’s calculate these values for our dataset. Thanks once again to the vegan package, these equations are readily available through the function diversity(). Here is the ?vegan::diversity() help page:

Here is another nice summary as well explaining what exactly the diversity() function is actually calculating:

Source: davidzeleny.net

Wrangle our dataframe comm_data_matrix as needed, then compare the richness, Gini-Simpson Index, Simpson’s Evenness, and Shannon Index values for plots 1 and 2.

# Wrangle data
cdm_1 <- comm_data_matrix %>%
  filter(plot_id == 1) %>%
  select(-plot_id)

cdm_2 <- comm_data_matrix %>%
  filter(plot_id == 2) %>%
  select(-plot_id)


# Calculate richness

print("Richness")
[1] "Richness"
sum(cdm_1 > 0)
[1] 5
sum(cdm_2 > 0)
[1] 5
# Calculate Gini-Simpson's diversity

print("Gini-Simpson")
[1] "Gini-Simpson"
diversity(cdm_1, index = "simpson")
[1] 0.8
diversity(cdm_2, index = "simpson")
[1] 0.6
# Calculate Simpson's Evenness
Gs_1 <- diversity(cdm_1, index = "simpson")
Gs_2 <- diversity(cdm_2, index = "simpson")

D_1 <- 1 - Gs_1
D_2 <- 1 - Gs_2

print("Simpson's Evenness")
[1] "Simpson's Evenness"
1 / (D_1 * ncol(cdm_1))
[1] 1
1 / (D_2 * ncol(cdm_2))
[1] 0.5
# Calculate Shannon's diversity:

print("Shannon Index")
[1] "Shannon Index"
diversity(cdm_1, index = "shannon")
[1] 1.609438
diversity(cdm_2, index = "shannon")
[1] 1.227529

How would you interpret these results?

  • Plots 1 and 2 have the same richness and very similar Gini-Simpson Index values which is related to measuring richness. However, for Shannon Entropy and even more so for Simpson’s Evenness the values the values are very distinct. Without checking for evenness we might have missed how distinct these two communities are with plot 1 being very even and plot 2 being dominated by a single species (triangle sp.)

References

Hill, Mark O. 1973. “Diversity and Evenness: A Unifying Notation and Its Consequences.” Ecology 54 (2): 427–32.
Mittelbach, Gary G, and Brian J McGill. 2019. Community Ecology. Oxford University Press.
Oksanen, Jari, Gavin L. Simpson, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O’Hara, et al. 2022. “Vegan: Community Ecology Package.” https://CRAN.R-project.org/package=vegan.
Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423.
Simpson, Edward H. 1949. “Measurement of Diversity.” Nature 163 (4148): 688–88.
Stevens, Martin Henry Hoffman. 2009. A Primer of Ecology with r. Springer.
Zeleny, D. 2020. “Analysis of Community Ecology Data in r. DavidZeleny. Net.” https://www.davidzeleny.net/anadat-r/doku.php/en:div-ind_r.