# Directly creating the long format dataframe
<- data.frame(
df plot_id = rep(c(1, 2, 3, 4, 5), each = 5),
species_id = rep(c("triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp"), times = 5),
count = c(2, 2, 2, 2, 2,
6, 1, 1, 1, 1,
4, 1, 2, 2, 2,
4, 2, 0, 4, 0,
0, 5, 3, 0, 2)
)
Biodiversity
There are many ways to measure biodiversity and methods to compare the biodiversity between populations or communities. The simplest two are often considered to be abundance and richness.
Richness
Do you remember the difference between richness and abundance?
- Richness is the number of species in a community while abundance is the number of individuals from each species in that community.
Do you remember the three ways we have talked about measuring diversity this semester?
- Alpha, beta, and gamma diversity
Gamma Diversity
It’s time to collect data along a transect! Create a dataframe for this dataset across the five vegetation zones which follow an elevation gradient:
head(df)
plot_id species_id count
1 1 triangle_sp 2
2 1 square_sp 2
3 1 cross_sp 2
4 1 circle_sp 2
5 1 x_sp 2
6 2 triangle_sp 6
What is the species richness of our region?
# There are various ways to accomplish this. Here are a couple examples of methods
<- length(unique(df$species_id))
gamma
gamma
[1] 5
<- df %>%
gamma select(species_id) %>%
n_distinct()
gamma
[1] 5
- 5
This total count of species richness across a region is referred to as \(\gamma\) diversity.
Alpha Diversity
Similarly, \(\alpha\) diversity is the richness of a local area or a single sample.
Calculate the \(\alpha\) diversity of plots 4 and 5
<- df %>%
alpha_4 filter(plot_id == 4) %>%
select(species_id) %>%
n_distinct()
<- df %>%
alpha_5 filter(plot_id == 5) %>%
select(species_id) %>%
n_distinct()
alpha_4
[1] 5
alpha_5
[1] 5
How similar are these plots? How distinct are they? Is this value or high or low for our region?
- \(\alpha\) diversity can start to answer these questions, but doesn’t provide too clear of an answer.
- Additionally, note that the species in plots 4 and 5 are extremely different, even though their richness is identical
Calculating \(\beta\) diversity can begin to help us explore these nuances.
Beta Diversity
\(\beta\) diversity can be defined as the number of species in a region not observed in sample or the turnover in species composition among sites. Therefore both \(\gamma\) (regional) and \(\alpha\) (site) diversity are directly related to \(\beta\) diversity.
\(\beta\) diversity is often calculated via additive partitioning.
\[ \beta \ =\ \gamma \ -\ \overline{\alpha } \]
Where \(\overline{\alpha }\) is the average \(\alpha\) diversity of the sample. Here, \(\beta\) diversity is the average number of species not found in a sample even though they are found in a region.
It is also commonly calculated using multiplicative partitioning:
\[ \beta \ =\ \frac{\gamma }{\overline{\alpha }} \]
This metric for turnover aids in determining how many unique assemblages exist in a region. It is similar to the original definition coined by Whittaker.
As with any step of model selection, you must be intentional about which version of \(\beta\) diversity is applicable to your question and sample.
We will calculate the beta diversity of sites 4 and 5. Once again, we will be using the vegan
package to help with this.
First, we will need to reformat the data into the format that vegan
prefers called a “Community data matrix,” with species names across the X-axis of the dataframe, sampling event down the Y-axis, and abundances where they meet. We will widen the dataframe by species name. In general data collection and data presentation should be done in a long format since it is more human-readable, whereas analysis should be done in the long format since it is more easily computer readable. To do so, we will use the tidyR
pivot_wider()
function.
<- df %>%
comm_data_matrix pivot_wider(names_from = species_id, values_from = count, values_fill = 0)
head(comm_data_matrix)
# A tibble: 5 × 6
plot_id triangle_sp square_sp cross_sp circle_sp x_sp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 2 2 2 2
2 2 6 1 1 1 1
3 3 4 1 2 2 2
4 4 4 2 0 4 0
5 5 0 5 3 0 2
To calculate \(\overline{\alpha }\), use the vegan
function specnumber()
:
<- mean(specnumber(df)) alpha_bar
This will need to be done after the dataframe has only species columns, because that is the format vegan
prefers.
With this information, calculate beta diversity for sites 1, 4, and 5.
# Wrangle data
<- comm_data_matrix %>%
df_1 filter(plot_id == 1) %>%
select(-plot_id) # Vegan only likes species columns
<- comm_data_matrix %>%
df_4 filter(plot_id == 4) %>%
select(-plot_id)
<- comm_data_matrix %>%
df_5 filter(plot_id == 5) %>%
select(-plot_id)
# Calculate alpha bar for each plot
<- mean(specnumber(df_1))
alpha_bar_1 <- mean(specnumber(df_4))
alpha_bar_4 <- mean(specnumber(df_5))
alpha_bar_5
# Calculate additive beta diversity for each plot
<- gamma - alpha_bar_1
beta_add_1 <- gamma - alpha_bar_4
beta_add_4 <- gamma - alpha_bar_5
beta_add_5
# Calculate multiplicative beta diversity for each plot
<- gamma / alpha_bar_1
beta_mult_1 <- gamma / alpha_bar_4
beta_mult_4 <- gamma / alpha_bar_5
beta_mult_5
# Output the results
paste("gamma:", gamma)
[1] "gamma: 5"
print("alpha_bar")
[1] "alpha_bar"
paste("alpha_bar_1:", alpha_bar_1)
[1] "alpha_bar_1: 5"
paste("alpha_bar_4:", alpha_bar_4)
[1] "alpha_bar_4: 3"
paste("alpha_bar_5:", alpha_bar_5)
[1] "alpha_bar_5: 3"
print("beta_add")
[1] "beta_add"
paste("beta_add_1:", beta_add_1)
[1] "beta_add_1: 0"
paste("beta_add_4:", beta_add_4)
[1] "beta_add_4: 2"
paste("beta_add_5:", beta_add_5)
[1] "beta_add_5: 2"
print("beta_mult_1")
[1] "beta_mult_1"
paste("beta_mult_1:", beta_mult_1)
[1] "beta_mult_1: 1"
paste("beta_mult_4:", beta_mult_4)
[1] "beta_mult_4: 1.66666666666667"
paste("beta_mult_5:", beta_mult_5)
[1] "beta_mult_5: 1.66666666666667"
What do these results imply?
These \(\beta\) diversity values for plot 1 are at the maximum, the richness of speices in this plot is represented of the whole region.
The \(\beta\) diversity values for plots 4 and 5 are equal, even though the actual species in these plots are very different
These \(\beta\) diversity values are useful because they are more readily comparable to other regions because they are normalized for the regional diversity (\(\gamma\) diversity)
Because \(\beta \ <\ \overline{\alpha }\), each plot contains more species from the region than it is lacking.
Abundance
Richness is the number of species in a community while abundance is the number of individuals from each species in that community.
How can we find abundance within the community we are studying?
- We need to count the number of individuals detected from each species.
Use dplyr to group our wide dataframe df
by species_id
and use summarize()
to create a column of total abundances called count
<- df %>%
abundances group_by(species_id) %>%
summarize(count = sum(count)) %>%
ungroup()
abundances
# A tibble: 5 × 2
species_id count
<chr> <dbl>
1 circle_sp 9
2 cross_sp 8
3 square_sp 11
4 triangle_sp 16
5 x_sp 7
Plot the abundances of our new dataframe abundances
as a barplot using ggplot2.
ggplot(data = abundances, aes(x = reorder(species_id, -count), y = count)) +
geom_col() +
labs(x = "Species", y = "Total Count Across Regions")
What do you notice about the abundances?
- The species “triangle_sp” is the dominant species, the others are more similar in proportion
Rank Abundance Curves
Rank abundance curves are a common way to visualize the distribution of species abundances in a community. They are a type of species accumulation curve, which are used to estimate the number of species in a community. They give you a quick way to compare how similar/different communities are just based on simple abundances.
<- df %>%
df_ranked group_by(plot_id, species_id) %>%
summarize(total_count = sum(count)) %>%
arrange(plot_id, desc(total_count)) %>%
mutate(rank = row_number()) %>%
ungroup()
`summarise()` has grouped output by 'plot_id'. You can override using the
`.groups` argument.
ggplot(df_ranked, aes(x = rank, y = total_count, color = factor(plot_id))) +
geom_point() +
geom_line() +
scale_x_continuous(trans = "log10", breaks = c(1, 10, 100), minor_breaks = NULL) +
scale_y_continuous(trans = "log10", breaks = c(1, 10, 100), minor_breaks = NULL) +
labs(x = "Abundance Rank (log scale)", y = "Abundance (log scale)", color = "Plot ID")
What do you notice?
Plot 2 is dominated by a single species. In the real world, many rank abundance curves take a similar shape with few dominant species and many rare species
Plot 1 has even abundance across species. We will explore evenness in depth later on.
Distances
Let’s plot for species “triangle_sp” vs. “circle_sp” for plots 1, 3, and 4 from our comm_data_matrix
.
<- comm_data_matrix %>%
df_tri_cir ::select(plot_id, triangle_sp, circle_sp) %>%
dplyrfilter(plot_id == 1 | plot_id == 3 | plot_id == 4)
ggplot(data = df_tri_cir, aes(x = triangle_sp, y = circle_sp)) +
geom_text(aes(label = plot_id)) +
labs(x = "triangle_sp", y = "circle_sp", title = "Abundances of triangle_sp vs. circle_sp")
Can you tell by looking at this plot how different the relationship is between these species by site?
- Perhaps 1 and 4 are different from each other, but it is hard to tell if 4 or 1 is more similar to plot 3
Euclidean Distance
We can calculate the distance between plots to quantify how different they are. In general, distances in ecology are primarily used to quantify dissimilarity. The first method is calculating the Euclidean distance between points. We just look at this x, y coordinate plot as a map. Therefore, this distance can be calculated with
\[ \sqrt{( x_{1} \ -\ x_{2})^{2} \ +\ ( y_{1} \ -\ y_{2})^{2}} \]
Calculate the Euclidean distance between plots 1 and 3 for our df df_tri_cir
<- df_tri_cir$triangle_sp[1]
x1 <- df_tri_cir$triangle_sp[2]
x2 <- df_tri_cir$circle_sp[1]
y1 <- df_tri_cir$circle_sp[2]
y2
sqrt((x1 - x2)^2 + (y1 - y2)^2)
[1] 2
Repeat this for plots 3 and 4
<- df_tri_cir$triangle_sp[2]
x1 <- df_tri_cir$triangle_sp[3]
x2 <- df_tri_cir$circle_sp[2]
y1 <- df_tri_cir$circle_sp[3]
y2
sqrt((x1 - x2)^2 + (y1 - y2)^2)
[1] 2
What do these results imply?
- Plots 1 and 4 seem to be equally similar to plot 3
Bray-Curtis/Sørenson Distance
However, this Euclidean distance can quickly focus heavily on more abundant species over rare species. To prevent this by weighing abundant and rare species evenly, we can calculate the abundances between two sites divided by the total abundances at each site. This is called the Bray-Curtis or Sørenson Distance.
\[ \sum \frac{|x_{ai} \ -\ x_{bi} |}{x_{ai} \ +\ x_{bi}} \]
To help us do so, we will introduce you to the currently most popular community ecology package in R, vegan
(Oksanen et al. 2022).
The package vegan
frequently takes community data matrices as the input and require columns to only represent species, so you must select()
for only species columns before passing a dataframe to vegan
.
The vegdist()
function from the vegan
package provides Bray-Curtis distance by default for an entire dataset. Filter for plots 1 and 3, select for “triangle_sp” and “circle_sp”, and use the vegdist()
function with method = "euclidean"
to check our previous answer:
<- comm_data_matrix %>%
df_tri_cir_dcm ::select(plot_id, triangle_sp, circle_sp) %>%
dplyrfilter(plot_id == 1 | plot_id == 3) %>%
::select(triangle_sp, circle_sp)
dplyr
vegdist(df_tri_cir_dcm, method="euclidean")
1
2 2
Now repeat using the Bray-Curtis distance
vegdist(df_tri_cir_dcm, method="bray")
1
2 0.2
Repeat for plots 3 versus 4, what do you notice in comparison to Euclidean distance results?
<- comm_data_matrix %>%
df_tri_cir_dcm ::select(plot_id, triangle_sp, circle_sp) %>%
dplyrfilter(plot_id == 3 | plot_id == 4) %>%
::select(triangle_sp, circle_sp)
dplyr
vegdist(df_tri_cir_dcm, method="bray")
1
2 0.1428571
- Even though following Euclidean distance plots 1 and 4 were equally similar to plot 3, following Bray-Curtis distance plots 3 and 4 are more similar than plots 1 and 3
Non-metric Multidimensional Scaling
We have focused on pairs of 2 species, but our dataframe has more than 2 columns. Let’s plot 3 columns (species “triangle_sp”, “circle_sp”, and “square_sp”) for plots 1, 3, and 4.
<- comm_data_matrix %>%
df_tri_cir_sq ::select(plot_id, triangle_sp, circle_sp, square_sp) %>%
dplyrfilter(plot_id == 1 | plot_id == 3 | plot_id == 4)
# Convert to plotly 3D scatter plot
plot_ly(data = df_tri_cir_sq, x = ~triangle_sp, y = ~circle_sp, z = ~square_sp, type = 'scatter3d', mode = 'markers') %>%
layout(scene = list(xaxis = list(title = 'triangle_sp'),
yaxis = list(title = 'circle_sp'),
zaxis = list(title = 'square_sp'),
camera = list(
eye = list(x = 0, y = 0, z = 2.5),
up = list(x = 0, y = 1, z = 0)
)))
At first this plot looks identical to our plot above of “triangle_sp” vs “circle_sp,” however when we rotate we can see the third dimension. Although this is pretty cool, it is even harder to visually inspect how similar/different these plots are.
Now, how can we visualize 4 or more columns (species “triangle_sp”, “circle_sp”, “square_sp”, and “species_x”) to see their distance for plots 1, 3, and 4? How about 5 columns?
One technique that attempts to visualize an optimal arrangement is Non-metric Multidimensional Scaling (NMDS). Note that the axes of the plotted results are arbitrary (non-metric) but instead represent the relative weight of relationships.
NMDS is an example of an ordination technique, which allow us to arrange multidimensional data and reduce it to low-dimensional space based on their similarities or dissimilarities. This is very useful for visualizing and analyzing complex data.
All of these ordination methods do something more or less similar to the following visualization of a simple PCA, they look at multivariate point cloud from a certain angle and take a 2D slice through that plane:
Once again, vegan
can help with this, among other options it can produce NMDS plots by either Euclidean or Bray-Curtis distance.
<- comm_data_matrix %>%
df_tri_cir_sq_x ::select(plot_id, triangle_sp, circle_sp, square_sp, x_sp) %>%
dplyrfilter(plot_id == 1 | plot_id == 3 | plot_id == 4)
<- metaMDS(df_tri_cir_sq_x %>% select(-plot_id), distance = "euclidean") nmds_Euc
Run 0 stress 0
Run 1 stress 0
... Procrustes: rmse 0.07171074 max resid 0.08708606
Run 2 stress 0
... Procrustes: rmse 0.03748834 max resid 0.04495218
Run 3 stress 0
... Procrustes: rmse 0.1470433 max resid 0.1858873
Run 4 stress 0
... Procrustes: rmse 0.2956769 max resid 0.4175908
Run 5 stress 0
... Procrustes: rmse 0.0490357 max resid 0.05851162
Run 6 stress 0
... Procrustes: rmse 0.1516245 max resid 0.2033164
Run 7 stress 0
... Procrustes: rmse 0.09362361 max resid 0.1028682
Run 8 stress 0
... Procrustes: rmse 0.02580521 max resid 0.02974746
Run 9 stress 0
... Procrustes: rmse 0.09593518 max resid 0.1029452
Run 10 stress 0
... Procrustes: rmse 0.2419987 max resid 0.3381087
Run 11 stress 0
... Procrustes: rmse 0.2578759 max resid 0.3545848
Run 12 stress 0
... Procrustes: rmse 0.2412986 max resid 0.2955211
Run 13 stress 0
... Procrustes: rmse 0.1268612 max resid 0.1581423
Run 14 stress 0
... Procrustes: rmse 0.05068433 max resid 0.05982502
Run 15 stress 0
... Procrustes: rmse 0.08900928 max resid 0.09679533
Run 16 stress 0
... Procrustes: rmse 0.08125188 max resid 0.08834711
Run 17 stress 0
... Procrustes: rmse 0.0829068 max resid 0.0880893
Run 18 stress 0
... Procrustes: rmse 0.03940576 max resid 0.04417649
Run 19 stress 0
... Procrustes: rmse 0.03521958 max resid 0.03982364
Run 20 stress 0
... Procrustes: rmse 0.2683186 max resid 0.3755661
*** Best solution was not repeated -- monoMDS stopping criteria:
20: stress < smin
Warning in metaMDS(df_tri_cir_sq_x %>% select(-plot_id), distance =
"euclidean"): stress is (nearly) zero: you may have insufficient data
<- as.data.frame(scores(nmds_Euc, display = "sites"))
scores_plot_df $plot_id <- df_tri_cir_sq_x$plot_id
scores_plot_df
ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
geom_text(aes(label = plot_id)) +
labs(title = "NMDS Plot with Euclidean Distance", x = "NMDS_1", y = "NMDS_2")
# Repeat for Bray-Curtis distance
<- metaMDS(df_tri_cir_sq_x %>% select(-plot_id), distance = "bray") nmds_Bray
Run 0 stress 0
Run 1 stress 0
... Procrustes: rmse 0.1026729 max resid 0.138582
Run 2 stress 0
... Procrustes: rmse 0.03238101 max resid 0.04118483
Run 3 stress 0
... Procrustes: rmse 0.2146753 max resid 0.3011667
Run 4 stress 0
... Procrustes: rmse 0.02246243 max resid 0.02763603
Run 5 stress 0
... Procrustes: rmse 0.1077626 max resid 0.1198168
Run 6 stress 0
... Procrustes: rmse 0.1215243 max resid 0.1657981
Run 7 stress 0
... Procrustes: rmse 0.100309 max resid 0.1130419
Run 8 stress 0
... Procrustes: rmse 0.1057652 max resid 0.1237793
Run 9 stress 0
... Procrustes: rmse 0.1455311 max resid 0.16061
Run 10 stress 0
... Procrustes: rmse 0.1924084 max resid 0.2401998
Run 11 stress 0
... Procrustes: rmse 0.07820513 max resid 0.09101498
Run 12 stress 0
... Procrustes: rmse 0.09694032 max resid 0.1122681
Run 13 stress 0
... Procrustes: rmse 0.1898867 max resid 0.2449223
Run 14 stress 0
... Procrustes: rmse 0.1942403 max resid 0.2607276
Run 15 stress 0
... Procrustes: rmse 0.1619204 max resid 0.197488
Run 16 stress 0
... Procrustes: rmse 0.08279267 max resid 0.09962706
Run 17 stress 0
... Procrustes: rmse 0.1924818 max resid 0.2684008
Run 18 stress 0
... Procrustes: rmse 0.1587719 max resid 0.2199899
Run 19 stress 0
... Procrustes: rmse 0.1780853 max resid 0.1856434
Run 20 stress 0
... Procrustes: rmse 0.2100083 max resid 0.2605126
*** Best solution was not repeated -- monoMDS stopping criteria:
20: stress < smin
Warning in metaMDS(df_tri_cir_sq_x %>% select(-plot_id), distance = "bray"):
stress is (nearly) zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
<- as.data.frame(scores(nmds_Bray, display = "sites"))
scores_plot_df $plot_id <- df_tri_cir_sq_x$plot_id
scores_plot_df
ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
geom_text(aes(label = plot_id)) +
labs(title = "NMDS Plot with Bray-Curtis Distance", x = "NMDS_1", y = "NMDS_2")
Compare this to our original plot, are there any differences that you notice?
<- comm_data_matrix %>%
df_tri_cir ::select(plot_id, triangle_sp, circle_sp) %>%
dplyrfilter(plot_id == 1 | plot_id == 3 | plot_id == 4)
ggplot(data = df_tri_cir, aes(x = triangle_sp, y = circle_sp)) +
geom_text(aes(label = plot_id)) +
labs(x = "triangle_sp", y = "circle_sp", title = "Abundances of triangle_sp vs. circle_sp")
NMDS plots can go well beyond 4-dimensions, because of this expandability, we can plot distances for all species in our sample simultaneously to identify which plots may be more different from one another as far as species abundances is concerned.
# Repeat creating NMDS using Bray-Curtis distance for all species
<- metaMDS(comm_data_matrix %>% select(-plot_id), distance = "bray") nmds_Bray
Run 0 stress 0
Run 1 stress 0
... Procrustes: rmse 0.2030446 max resid 0.2747318
Run 2 stress 0
... Procrustes: rmse 0.07315567 max resid 0.1075529
Run 3 stress 5.826605e-05
... Procrustes: rmse 0.1376449 max resid 0.1645601
Run 4 stress 0
... Procrustes: rmse 0.1687072 max resid 0.2028889
Run 5 stress 0
... Procrustes: rmse 0.1949457 max resid 0.2825307
Run 6 stress 6.16511e-05
... Procrustes: rmse 0.1878843 max resid 0.3295183
Run 7 stress 0
... Procrustes: rmse 0.1573602 max resid 0.2940581
Run 8 stress 0
... Procrustes: rmse 0.08422943 max resid 0.1286996
Run 9 stress 0
... Procrustes: rmse 0.1081722 max resid 0.1791993
Run 10 stress 9.905905e-05
... Procrustes: rmse 0.2062697 max resid 0.3621891
Run 11 stress 0
... Procrustes: rmse 0.07446945 max resid 0.09334791
Run 12 stress 0
... Procrustes: rmse 0.07404918 max resid 0.09043398
Run 13 stress 7.930351e-05
... Procrustes: rmse 0.1300944 max resid 0.2363241
Run 14 stress 5.163153e-05
... Procrustes: rmse 0.1899391 max resid 0.2320315
Run 15 stress 0
... Procrustes: rmse 0.1326119 max resid 0.2120979
Run 16 stress 0
... Procrustes: rmse 0.08030282 max resid 0.1132989
Run 17 stress 0
... Procrustes: rmse 0.05150546 max resid 0.0877158
Run 18 stress 9.904681e-05
... Procrustes: rmse 0.2449894 max resid 0.326838
Run 19 stress 0
... Procrustes: rmse 0.09960809 max resid 0.1500808
Run 20 stress 0
... Procrustes: rmse 0.1434484 max resid 0.2017473
*** Best solution was not repeated -- monoMDS stopping criteria:
20: stress < smin
Warning in metaMDS(comm_data_matrix %>% select(-plot_id), distance = "bray"):
stress is (nearly) zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
<- as.data.frame(scores(nmds_Bray, display = "sites"))
scores_plot_df $plot_id <- comm_data_matrix$plot_id
scores_plot_df
ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
geom_text(aes(label = plot_id)) +
labs(title = "NMDS Plot with Bray-Curtis Distance", x = "NMDS_1", y = "NMDS_2")
What hypotheses can we make based on this plot?
- We can make various hypotheses, such as that plots 1 and 3 are more similar to one another than to the other plots regrading overall species abundances
Again, NMDS is an example of an ordination technique, other examples include:
- Principal Component Analysis (PCA) is primarily used for uncovering patterns in data by reducing its dimensions linearly, without considering external variables.
- Canonical Correspondence Analysis (CCA) to explore the relationships between two sets of variables, particularly when one set is hypothesized to influence the other.
- Principal Coordinates Analysis (PCoA) focuses on preserving the pairwise distances between observations when data is reduced to a lower-dimensional space.
Diversity Indices
Let’s bring our understanding together to quantify diversity using indices. Diversity is challenging to quantify because it is difficult to agree on a definition. For example, we could attempt to quantify biodiversity with any of the following:
- Probability of encountering a species at random
- Uncertainty of possible community states
- Variance in species composition between sites
Species Richness Index
The most common biodiversity index is the Species Richness Index, counting the number of unique species in a community. We already completed this task above, calling it \(\gamma\) diversity for our system:
<- df %>%
species_richness select(species_id) %>%
n_distinct()
species_richness
[1] 5
<- ncol(comm_data_matrix) - 1
species_richness
species_richness
[1] 5
However, it is highly sensitive to variations in sampling effort because it treats all species equally, weighing rare and common species equally despite the greater likelihood of rare species going undetected.
It does not account for evenness: the fact that some species in the community are common and others are rare.
What do you notice about the examples above?
- They have equal species richness, however the communities are very different based on evenness.
Shannon and Simpson Diversity Indices
The Shannon(Shannon 1948) and Simpson Diversity Indices(Simpson 1949), both special cases of the Hill’s Number(Hill 1973), are very popular biodiversity indices because they focus on both richness and evenness.
\[ \begin{array}{l} Shannon-Wiener\ Diversity\ =-\sum _{i\ =\ i}^{S} p_{i}\ln p_{i}\\ \\ Simpson\ Diversity\ =\frac{1}{\sum _{i\ =\ 1}^{S} \ p_{i}^{2}}\\ \\ p_{i} \ =\ relative\ abundance\ of\ species\ i\\ S\ =\ total\ number\ of\ species \end{array} \]
You will likely see the Shannon Index referred to in different ways throughout the literature because of articles/books that were published on the topic by three different mathematicians over the course of a couple of years. Here is a quick guide to the alternative names you will encounter:
As richness and evenness increase, these indices increase. It is important to remember that Simpson’s Diversity Index emphasizes common species, while the Shannon Diversity Index is biased towards rare species. Here is how you can interpret these results:
Shannon’s Index interpretation:
- Shannon’s Index is based in information theory and represents the uncertainty that we will be able to predict which species will be randomly drawn from a community.
- Values of \(H\) (Shannon Entropy) are usually between 1.5-3.5 (the units are bits of information)
- The maximum value of \(H\) represents a perfectly even community
Simpson’s Index interpretation:
- Simpson’s Index represents the probability that two randomly selected individuals are from the same species.
- Simpson’s Index decreases with increasing richness, because of this Gini-Simpson index (\(1 - D\)) is often used to make value more comparable with other indices.
- The values of \(D\) range from 0 to 1
Because Simpson’s/Gini-Simpson’s indices refer more directly to richness than evenness, we can also calculate Simpson’s Evenness (“equitability”) where \(S\) is the number of species:
\[ \frac{1}{D S} \]
Now let’s calculate these values for our dataset. Thanks once again to the vegan
package, these equations are readily available through the function diversity()
. Here is the ?vegan::diversity()
help page:
Here is another nice summary as well explaining what exactly the diversity()
function is actually calculating:
Wrangle our dataframe comm_data_matrix
as needed, then compare the richness, Gini-Simpson Index, Simpson’s Evenness, and Shannon Index values for plots 1 and 2.
# Wrangle data
<- comm_data_matrix %>%
cdm_1 filter(plot_id == 1) %>%
select(-plot_id)
<- comm_data_matrix %>%
cdm_2 filter(plot_id == 2) %>%
select(-plot_id)
# Calculate richness
print("Richness")
[1] "Richness"
sum(cdm_1 > 0)
[1] 5
sum(cdm_2 > 0)
[1] 5
# Calculate Gini-Simpson's diversity
print("Gini-Simpson")
[1] "Gini-Simpson"
diversity(cdm_1, index = "simpson")
[1] 0.8
diversity(cdm_2, index = "simpson")
[1] 0.6
# Calculate Simpson's Evenness
<- diversity(cdm_1, index = "simpson")
Gs_1 <- diversity(cdm_2, index = "simpson")
Gs_2
<- 1 - Gs_1
D_1 <- 1 - Gs_2
D_2
print("Simpson's Evenness")
[1] "Simpson's Evenness"
1 / (D_1 * ncol(cdm_1))
[1] 1
1 / (D_2 * ncol(cdm_2))
[1] 0.5
# Calculate Shannon's diversity:
print("Shannon Index")
[1] "Shannon Index"
diversity(cdm_1, index = "shannon")
[1] 1.609438
diversity(cdm_2, index = "shannon")
[1] 1.227529
How would you interpret these results?
- Plots 1 and 2 have the same richness and very similar Gini-Simpson Index values which is related to measuring richness. However, for Shannon Entropy and even more so for Simpson’s Evenness the values the values are very distinct. Without checking for evenness we might have missed how distinct these two communities are with plot 1 being very even and plot 2 being dominated by a single species (triangle sp.)