# Directly creating the long format dataframe
<- data.frame(
df
plot_id = c(1, 1, 1, 1, 1,
2, 2, 2, 2, 2,
3, 3, 3, 3, 3,
4, 4, 4, 4, 4,
5, 5, 5, 5, 5),
species_id = c("triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp",
"triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp",
"triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp",
"triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp",
"triangle_sp", "square_sp", "cross_sp", "circle_sp", "x_sp"),
count = c(2, 2, 2, 2, 2,
6, 1, 1, 1, 1,
4, 1, 2, 2, 2,
4, 2, 0, 4, 0,
0, 5, 3, 0, 2)
)
Biodiversity
There are many ways to measure biodiversity and to compare these metrics between populations or communities. The simplest two are often considered to be richness and abundance.
Richness
Do you remember the difference between richness and abundance?
- Richness is the number of species in a community while abundance is the number of individuals from each species in that community.
Do you remember the three ways we have talked about measuring diversity this semester?
- Alpha, beta, and gamma diversity
Gamma Diversity
It’s time to collect data along a transect! Create a dataframe for this dataset across the five vegetation zones which follow an elevation gradient:
head(df)
plot_id species_id count
1 1 triangle_sp 2
2 1 square_sp 2
3 1 cross_sp 2
4 1 circle_sp 2
5 1 x_sp 2
6 2 triangle_sp 6
What is the species richness of our region?
# There are various ways to accomplish this. Here are a couple examples of methods
<- length(unique(df$species_id))
gamma
gamma
[1] 5
<- df %>%
gamma select(species_id) %>%
n_distinct()
gamma
[1] 5
This total count of species richness across a region is referred to as \(\gamma\) diversity.
Alpha Diversity
Similarly, \(\alpha\) diversity is the richness of a local area or a single sample.
Calculate the \(\alpha\) diversity of plots 1, 2, 4, and 5 using a dplyr
pipe
<- df %>%
alpha_1 filter(plot_id == 1) %>%
filter(count > 0) %>%
select(species_id) %>%
n_distinct()
<- df %>%
alpha_2 filter(plot_id == 2) %>%
filter(count > 0) %>%
select(species_id) %>%
n_distinct()
<- df %>%
alpha_4 filter(plot_id == 4) %>%
filter(count > 0) %>%
select(species_id) %>%
n_distinct()
<- df %>%
alpha_5 filter(plot_id == 5) %>%
filter(count > 0) %>%
select(species_id) %>%
n_distinct()
alpha_1
[1] 5
alpha_2
[1] 5
alpha_4
[1] 3
alpha_5
[1] 3
How similar are these plots? How distinct are they? Is this value or high or low for our region?
- \(\alpha\) diversity can start to answer these questions, but doesn’t provide too clear of an answer.
- Additionally, note that the species in plots 4 and 5 are extremely different, even though their richness (alpha diversity) is identical
Calculating \(\beta\) diversity can begin to help us explore these nuances.
Beta Diversity
\(\beta\) diversity has two distinct definitions. It is important to understand which definition someone is using when referring to \(\beta\) diversity. These definitions are:
The turnover in species composition among sites across a region or
The number of species in a region not observed in a given sample
Therefore both \(\alpha\) (site) and \(\gamma\) (regional) diversity are directly related to \(\beta\) diversity.
\(\beta\) diversity serves as a crucial bridge connecting local biodiversity (\(\alpha\) diversity) with the wider regional species pool (\(\gamma\) diversity). Examining beta diversity is central to community ecology, as it explores the factors that cause species assemblages to vary in similarity across various locations and times.
It is a very popular metric whose nuances are still being explored and perfected for specific scenarios:
Classical Beta Diversity Metrics
In its simplest and original form, turnover was calculated via multiplicative partitioning:
\[ \beta \ =\ \frac{\gamma }{\overline{\alpha }} \]
Where \(\overline{\alpha }\) is the average \(\alpha\) diversity of the sample. This formula is similar to the original definition coined by Whittaker(Whittaker 1960).
The second traditional way \(\beta\) diversity was calculated was via additive partitioning.
\[ \beta \ =\ \gamma \ -\ \overline{\alpha } \]
Here, \(\beta\) diversity is the average number of species not found in a sample, even though they are found in that region.
We will calculate the \(\beta\) diversity of sites 4 and 5. To help us do so, we will introduce you to the currently most popular community ecology package in R, vegan
(Oksanen et al. 2022).
First, we will need to reformat the data into the format that vegan
prefers called a “community data matrix,” with only species names as column names, sampling events down the Y-axis, and abundances where they meet.
For example, imagine you have a day of sampling in a single plot in the field. You would tally the amount of each species you see that day in that same row.
Inspect our dataframe df
df
plot_id species_id count
1 1 triangle_sp 2
2 1 square_sp 2
3 1 cross_sp 2
4 1 circle_sp 2
5 1 x_sp 2
6 2 triangle_sp 6
7 2 square_sp 1
8 2 cross_sp 1
9 2 circle_sp 1
10 2 x_sp 1
11 3 triangle_sp 4
12 3 square_sp 1
13 3 cross_sp 2
14 3 circle_sp 2
15 3 x_sp 2
16 4 triangle_sp 4
17 4 square_sp 2
18 4 cross_sp 0
19 4 circle_sp 4
20 4 x_sp 0
21 5 triangle_sp 0
22 5 square_sp 5
23 5 cross_sp 3
24 5 circle_sp 0
25 5 x_sp 2
Are there any changes that need to be made in order for this dataframe to be in community data matrix format?
Extra columns (plot_id, species_id) need to be removed. Community data matrices can only include species names as column names and abundances
The matrix needs to be pivoted. The species names are entries instead of column names.
The table is in long format so we will need to widen the dataframe by species name. In general data collection and data presentation should be done in a long format since it is more human-readable, whereas analysis should be done in the long format since it is more easily computer readable.
To do so, we will use the tidyR
(Wickham and Wickham 2017) pivot_wider()
function.
<- df %>%
comm_data_matrix pivot_wider(names_from = species_id, values_from = count, values_fill = 0)
head(comm_data_matrix)
# A tibble: 5 × 6
plot_id triangle_sp square_sp cross_sp circle_sp x_sp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 2 2 2 2
2 2 6 1 1 1 1
3 3 4 1 2 2 2
4 4 4 2 0 4 0
5 5 0 5 3 0 2
Remember, community data matrices require columns to only represent species, so you must select()
for only species columns before passing a dataframe to vegan
.
<- comm_data_matrix %>%
comm_data_matrix select(-plot_id)
comm_data_matrix
# A tibble: 5 × 5
triangle_sp square_sp cross_sp circle_sp x_sp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 2 2 2
2 6 1 1 1 1
3 4 1 2 2 2
4 4 2 0 4 0
5 0 5 3 0 2
With this information, create separate community data matrices (df_1
, df_2
,df_4
, df_5
) for sites 1, 2, 4, and 5 respectively using filter(row_number() == <row number>)
<- comm_data_matrix %>%
df_1 filter(row_number() == 1)
<- comm_data_matrix %>%
df_2 filter(row_number() == 2)
<- comm_data_matrix %>%
df_4 filter(row_number() == 4)
<- comm_data_matrix %>%
df_5 filter(row_number() == 5)
To calculate \(\overline{\alpha }\), use the vegan
function specnumber()
:
<- mean(specnumber(df)) alpha_bar
Noting that specnumber(df)
is a much shorter equivalent to what we did above:
<- df %>%
alpha filter(plot_id == <plot number>) %>%
filter(count > 0) %>%
select(<species column>) %>%
n_distinct()
Use this to create the objects alpha_bar_12
and alpha_bar_45
<- mean(specnumber(df_1), specnumber(df_2))
alpha_bar_12 <- mean(specnumber(df_4), specnumber(df_5)) alpha_bar_45
Use these objects and the gamma diversity we calculated above to calculate multiplicative and additive beta diversity for each plot:
# Calculate multiplicative beta diversity for each plot
<- gamma / alpha_bar_12
beta_mult_12 <- gamma / alpha_bar_45 beta_mult_45
# Calculate additive beta diversity for each plot
<- gamma - alpha_bar_12
beta_add_12 <- gamma - alpha_bar_45 beta_add_45
Metric Value
1 gamma 5.000000
2 alpha_1 5.000000
3 alpha_2 5.000000
4 alpha_4 3.000000
5 alpha_5 3.000000
6 alpha_bar_12 5.000000
7 alpha_bar_45 3.000000
8 beta_mult_12 1.000000
9 beta_mult_45 1.666667
10 beta_add_12 0.000000
11 beta_add_45 2.000000
What do these results imply?
These \(\beta\) diversity values are useful because they are more readily comparable to other regions because they are normalized for the regional diversity (\(\gamma\) diversity)
The \(\beta\) diversity values for plots 1 and 2 are at the maximum ( \(\overline{\alpha}\) = \(\beta\) ), the richness of species in these plots is representative of the whole region. The multiplicative and additive values of 0 and 1 respectively imply no species turnover between these plots.
The multiplicative and additive \(\beta\) diversity values for plots 4 and 5 is higher than for plots 1 and 2, therefore these plots have more species turnover. High values of \(\beta\) diversity imply these are diverse, non-redundant plots that together build the regional diversity.
Jaccard and Sørenson Indices to Calculate Beta Diversity
For more exact estimates of \(\beta\) diversity, we can calculate the Jaccard and Sørenson distances. These distances are both measures of similarity between two sites, with the Jaccard index being a measure of similarity and the Sørenson index being a measure of dissimilarity. They also both only consider presence/absence, so we are still only looking at species richness.
Let’s start with Jaccard. Sites that are most similar are closer to 0 and those that are most different are closer to 1.
# Calculate Jaccard distance
<- vegdist(comm_data_matrix, method = "jaccard")
jaccard_dist
jaccard_dist
1 2 3 4
2 0.5714286
3 0.2500000 0.3846154
4 0.5714286 0.5714286 0.5000000
5 0.5714286 0.8235294 0.6875000 0.8888889
Which plots are are most similar and which are different? Does this match your intuition?
Now let’s do Sørenson which ranges from 0 (identical) to 1 (completely dissimilar).
# Calculate Sørensen distance
# Sørensen is equivalent to "bray" method with binary data in vegdist
<- vegdist(comm_data_matrix, method = "bray", binary = TRUE)
sorensen_dist
sorensen_dist
1 2 3 4
2 0.0000000
3 0.0000000 0.0000000
4 0.2500000 0.2500000 0.2500000
5 0.2500000 0.2500000 0.2500000 0.6666667
Which plots are are most similar and which are different? Does this match your intuition?
Other Methods to Calculate Beta Diversity
You must be intentional about which version of \(\beta\) diversity is applicable to your question and system. Many options are available and continue to be developed/popularized for calculating \(\beta\) diversity for different scenarios. Based on our dataset and this flowchart, which method(s) might be the best for calculating \(\beta\) diversity in our case?
- We have not yet considered abundances. However, since we have abundance data we can integrate it into our calculation of \(\beta\) diversity to more accurately differentiate between communities.
- It is likely that we want to exclude joint absences. With the low number of species in our survey, there are many more species in the world that our system is lacking than it includes. Therefore, what species are in our plot gives us more information than what is not in our plot.
- Choosing our exact method for calculation will depend on our research question. Are we most interested in rare species or the proportion of species? Are we more interested in relative abundance or log abundance (for example, is there a large difference between abundances)?
We will explore abundance metrics in the following section
Abundance
Remember, richness is the number of species in a community while abundance is the number of individuals from each species in that community.
Up until this point, we have only been using richness in our metrics, however our sample includes abundance values.
Use dplyr
to group our long dataframe df
by species_id
and use mutate()
to create a column of total abundances called total_count
.
<- df %>%
df group_by(species_id) %>%
mutate(total_count = sum(count)) %>%
ungroup()
df
# A tibble: 25 × 4
plot_id species_id count total_count
<dbl> <chr> <dbl> <dbl>
1 1 triangle_sp 2 16
2 1 square_sp 2 11
3 1 cross_sp 2 8
4 1 circle_sp 2 9
5 1 x_sp 2 7
6 2 triangle_sp 6 16
7 2 square_sp 1 11
8 2 cross_sp 1 8
9 2 circle_sp 1 9
10 2 x_sp 1 7
# ℹ 15 more rows
Plot the total abundances of our df
as a barplot (geom_col()
) using ggplot2(Wickham 2016).
ggplot(data = df, aes(x = species_id, y = total_count)) +
geom_col() +
labs(x = "Species", y = "Total Count Across Regions")
What do you notice about the abundances?
- The species “triangle_sp” is the dominant species, the others are more similar in proportion
Rank Abundance Curves
Rank abundance curves are a common way to visualize the distribution of species abundances in a community. They are a type of species accumulation curve, which are used to estimate the number of species in a community. They give you a quick way to compare how similar/different communities are just based on simple abundances.
# Rank abundances of each species within each plot
<- df %>%
df group_by(plot_id, species_id) %>%
summarize(plot_count = sum(count)) %>%
arrange(plot_id, desc(plot_count)) %>%
mutate(rank = row_number()) %>%
ungroup()
`summarise()` has grouped output by 'plot_id'. You can override using the
`.groups` argument.
ggplot(df, aes(x = rank, y = plot_count, color = factor(plot_id))) +
geom_point() +
geom_line() +
labs(x = "Abundance Rank (log scale)", y = "Abundance (log scale)", color = "Plot ID")
What do you notice?
Plot 2 is dominated by a single species. In the real world, many rank abundance curves take a similar shape with few dominant species and many rare species
Plot 1 has even abundance across species. We will explore evenness in depth later on.
Distances
For simple distances to compare plots, we need to select only 2 species compositions to compare between plots because basic plots are 2-dimensional.
Let’s plot for species “triangle_sp” vs. “circle_sp” for plots 1, 3, and 4 from our comm_data_matrix
.
<- comm_data_matrix %>%
df_tri_cir select(triangle_sp, circle_sp) %>%
mutate(plot_id = row_number()) %>%
filter(plot_id == 1 | plot_id == 3 | plot_id == 4)
ggplot(data = df_tri_cir, aes(x = triangle_sp, y = circle_sp)) +
geom_text(aes(label = plot_id)) +
labs(x = "triangle_sp", y = "circle_sp", title = "Abundances of triangle_sp vs. circle_sp")
Can you tell by looking at this plot how different the relationship is between these species by site?
- Perhaps 1 and 4 are different from each other, but it is hard to tell if 4 or 1 is more similar to plot 3
Euclidean Distance
We can calculate the distance between plots to quantify how different they are. In general, distances in ecology are primarily used to quantify dissimilarity. The first method is calculating the Euclidean distance between points. We just look at this x, y coordinate plot as a map. Therefore, this distance can be calculated with
\[ \sqrt{( x_{1} \ -\ x_{2})^{2} \ +\ ( y_{1} \ -\ y_{2})^{2}} \]
Calculate the Euclidean distance between plots 1 and 3 for our df df_tri_cir
<- df_tri_cir$triangle_sp[1]
x1 <- df_tri_cir$triangle_sp[2]
x2 <- df_tri_cir$circle_sp[1]
y1 <- df_tri_cir$circle_sp[2] y2
# Calculate Euclidean distance
sqrt((x1 - x2)^2 + (y1 - y2)^2)
[1] 2
Repeat this for plots 3 and 4
<- df_tri_cir$triangle_sp[2]
x1 <- df_tri_cir$triangle_sp[3]
x2 <- df_tri_cir$circle_sp[2]
y1 <- df_tri_cir$circle_sp[3] y2
sqrt((x1 - x2)^2 + (y1 - y2)^2)
[1] 2
What do these results imply?
- Plots 1 and 4 seem to be equally similar to plot 3
Bray-Curtis Distance
However, this Euclidean distance can quickly focus too heavily on more abundant species over rare species. To prevent this by weighing abundant and rare species evenly, we can calculate the abundances between two sites divided by the total abundances at each site. This is called the Bray-Curtis distance.
\[ \sum \frac{|x_{ai} \ -\ x_{bi} |}{x_{ai} \ +\ x_{bi}} \]
The vegdist()
function from the vegan
package provides Bray-Curtis distance by default for an entire dataset. Filter for plots 1 and 3, select for “triangle_sp” and “circle_sp”, and use the vegdist()
function with method = "euclidean"
to check our previous answer. The slice()
function from dplyr
is a straightforward way to filter for rows based on row number.
<- comm_data_matrix %>%
df_tri_cir select(triangle_sp, circle_sp) %>%
slice(1, 3)
vegdist(df_tri_cir, method="euclidean")
1
2 2
This is the same answer we got above.
With vegan
we can complete these calculations easily for our whole dataset:
vegdist(comm_data_matrix, method="euclidean")
1 2 3 4
2 4.472136
3 2.236068 2.645751
4 4.000000 4.000000 3.605551
5 4.242641 7.615773 6.082763 7.348469
Now repeat using the Bray-Curtis distance
vegdist(comm_data_matrix, method="bray")
1 2 3 4
2 0.4000000
3 0.1428571 0.2380952
4 0.4000000 0.4000000 0.3333333
5 0.4000000 0.7000000 0.5238095 0.8000000
What do you notice in comparison to Euclidean distance results?
- Even though following Euclidean distance plots 1 and 4 were equally similar to plot 3, following Bray-Curtis distance plots 3 and 4 are more similar than plots 1 and 3
Non-metric Multidimensional Scaling
We have focused on pairs of 2 species, but our dataframe has more than 2 columns. Let’s plot 3 columns (species “triangle_sp”, “circle_sp”, and “square_sp”) for plots 1, 3, and 4. Wrangle data to create df_tri_cir_sq
object.
<- comm_data_matrix %>%
df_tri_cir_sq select(triangle_sp, circle_sp, square_sp) %>%
slice(1, 3, 4)
Now that we have 3 species, we can plot them in 3 dimensions.
# Convert to plotly 3D scatter plot
plot_ly(data = df_tri_cir_sq, x = ~triangle_sp, y = ~circle_sp, z = ~square_sp, type = 'scatter3d', mode = 'markers') %>%
layout(scene = list(xaxis = list(title = 'triangle_sp'),
yaxis = list(title = 'circle_sp'),
zaxis = list(title = 'square_sp'),
camera = list(
eye = list(x = 0, y = 0, z = 2.5),
up = list(x = 0, y = 1, z = 0)
)))
At first this plot looks identical to our plot above of “triangle_sp” vs “circle_sp,” however when we rotate we can see the third dimension. Although this looks pretty cool, it is even harder to visually inspect how similar/different these plots are.
Now, how can we visualize 4 or more columns (species “triangle_sp”, “circle_sp”, “square_sp”, and “species_x”) to see their distance for plots 1, 3, and 4? How about 5 columns?
One technique that attempts to visualize an optimal arrangement is Non-metric Multidimensional Scaling (NMDS). Note that the axes of the plotted results are arbitrary (non-metric) but instead represent the relative weight of relationships.
NMDS is an example of an ordination technique, which allow us to arrange multidimensional data and reduce it to low-dimensional space based on their similarities or dissimilarities. This is very useful for visualizing and analyzing complex data.
All of these ordination methods do something more or less similar to the following visualization of a simple PCA, they look at multivariate point cloud from a certain angle and take a 2D slice through that plane:
Once again, vegan
can help with this, among other options it can produce NMDS plots by either Euclidean or Bray-Curtis distance.
# Wrangle
<- comm_data_matrix %>%
df_tri_cir_sq_x select(triangle_sp, circle_sp, square_sp, x_sp) %>%
slice(1, 3, 4)
<- metaMDS(df_tri_cir_sq_x, distance = "euclidean") nmds_Euc
Warning in metaMDS(df_tri_cir_sq_x, distance = "euclidean"): stress is (nearly)
zero: you may have insufficient data
<- as.data.frame(scores(nmds_Euc, display = "sites"))
scores_plot_df
<- scores_plot_df %>%
scores_plot_df mutate(plot_id = c(1, 3, 4))
# Plot
ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
geom_text(aes(label = plot_id)) +
labs(title = "NMDS Plot with Euclidean Distance", x = "NMDS_1", y = "NMDS_2")
Repeat for Bray-Curtis distance
# Repeat for Bray-Curtis distance
<- comm_data_matrix %>%
df_tri_cir_sq_x select(triangle_sp, circle_sp, square_sp, x_sp) %>%
slice(1, 3, 4)
<- metaMDS(df_tri_cir_sq_x, distance = "bray") nmds_Bray
Run 0 stress 0
Run 1 stress 0
... Procrustes: rmse 0.0685145 max resid 0.08513997
Run 2 stress 0
... Procrustes: rmse 0.0673008 max resid 0.08091399
Run 3 stress 0
... Procrustes: rmse 0.2473814 max resid 0.349626
Run 4 stress 0
... Procrustes: rmse 0.06474159 max resid 0.08166007
Run 5 stress 0
... Procrustes: rmse 0.07408706 max resid 0.08920374
Run 6 stress 0
... Procrustes: rmse 0.09968261 max resid 0.1344664
Run 7 stress 0
... Procrustes: rmse 0.06882658 max resid 0.0862992
Run 8 stress 0
... Procrustes: rmse 0.1826825 max resid 0.1879906
Run 9 stress 0
... Procrustes: rmse 0.1648719 max resid 0.2234559
Run 10 stress 0
... Procrustes: rmse 0.1335361 max resid 0.1461431
Run 11 stress 0
... Procrustes: rmse 0.1693015 max resid 0.2102284
Run 12 stress 0
... Procrustes: rmse 0.2507016 max resid 0.3082039
Run 13 stress 0
... Procrustes: rmse 0.1698787 max resid 0.1774413
Run 14 stress 0
... Procrustes: rmse 0.09633324 max resid 0.1227606
Run 15 stress 0
... Procrustes: rmse 0.08850214 max resid 0.1010874
Run 16 stress 0
... Procrustes: rmse 0.08261412 max resid 0.1101298
Run 17 stress 0
... Procrustes: rmse 0.1101204 max resid 0.148136
Run 18 stress 2.085682e-05
... Procrustes: rmse 0.1080145 max resid 0.1292131
Run 19 stress 0
... Procrustes: rmse 0.08904901 max resid 0.1191981
Run 20 stress 0
... Procrustes: rmse 0.161462 max resid 0.2214368
*** Best solution was not repeated -- monoMDS stopping criteria:
20: stress < smin
Warning in metaMDS(df_tri_cir_sq_x, distance = "bray"): stress is (nearly)
zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
<- as.data.frame(scores(nmds_Euc, display = "sites"))
scores_plot_df
<- scores_plot_df %>%
scores_plot_df mutate(plot_id = c(1, 3, 4))
ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
geom_text(aes(label = plot_id)) +
labs(title = "NMDS Plot with Bray-Curtis Distance", x = "NMDS_1", y = "NMDS_2")
Compare this to our original 2D plot where we used triangle and circle species to try to compare plots, are there any differences that you notice?
<- comm_data_matrix %>%
df_tri_cir select(triangle_sp, circle_sp) %>%
mutate(plot_id = row_number()) %>%
filter(plot_id == 1 | plot_id == 3 | plot_id == 4)
ggplot(data = df_tri_cir, aes(x = triangle_sp, y = circle_sp)) +
geom_text(aes(label = plot_id)) +
labs(x = "triangle_sp", y = "circle_sp", title = "Abundances of triangle_sp vs. circle_sp")
NMDS plots can go well beyond 4-dimensions. Because of this expandability, we can plot distances for all species in our sample simultaneously to identify which plots may be more different from one another as far as species abundances is concerned.
# Repeat creating NMDS using Bray-Curtis distance for all species
<- metaMDS(comm_data_matrix, distance = "bray") nmds_Bray
Run 0 stress 0
Run 1 stress 0
... Procrustes: rmse 0.1041114 max resid 0.1325716
Run 2 stress 0
... Procrustes: rmse 0.1489232 max resid 0.2743453
Run 3 stress 8.912132e-05
... Procrustes: rmse 0.2079909 max resid 0.3658446
Run 4 stress 0
... Procrustes: rmse 0.1099848 max resid 0.1856015
Run 5 stress 0
... Procrustes: rmse 0.09470483 max resid 0.1642575
Run 6 stress 5.2717e-05
... Procrustes: rmse 0.2132488 max resid 0.2750503
Run 7 stress 0
... Procrustes: rmse 0.1351432 max resid 0.1790358
Run 8 stress 0
... Procrustes: rmse 0.08050803 max resid 0.1121345
Run 9 stress 0.0003124553
... Procrustes: rmse 0.2211435 max resid 0.3031416
Run 10 stress 0
... Procrustes: rmse 0.08620733 max resid 0.1599822
Run 11 stress 8.340072e-05
... Procrustes: rmse 0.223066 max resid 0.291417
Run 12 stress 0
... Procrustes: rmse 0.06181389 max resid 0.0904574
Run 13 stress 3.03667e-05
... Procrustes: rmse 0.2417417 max resid 0.3402684
Run 14 stress 0
... Procrustes: rmse 0.1491092 max resid 0.1700833
Run 15 stress 0
... Procrustes: rmse 0.2008313 max resid 0.279812
Run 16 stress 8.998705e-05
... Procrustes: rmse 0.2576293 max resid 0.3560075
Run 17 stress 0
... Procrustes: rmse 0.1989037 max resid 0.2812514
Run 18 stress 0
... Procrustes: rmse 0.1637751 max resid 0.2166281
Run 19 stress 0
... Procrustes: rmse 0.1213147 max resid 0.2018581
Run 20 stress 0
... Procrustes: rmse 0.05847153 max resid 0.1008559
*** Best solution was not repeated -- monoMDS stopping criteria:
1: no. of iterations >= maxit
19: stress < smin
Warning in metaMDS(comm_data_matrix, distance = "bray"): stress is (nearly)
zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
<- as.data.frame(scores(nmds_Bray, display = "sites"))
scores_plot_df
<- scores_plot_df %>%
scores_plot_df mutate(plot_id = c(1, 2, 3, 4, 5))
ggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +
geom_text(aes(label = plot_id)) +
labs(title = "NMDS Plot with Bray-Curtis Distance", x = "NMDS_1", y = "NMDS_2")
What hypotheses can we make based on this plot?
- We can make various hypotheses, such as that plots 1 and 3 are more similar to one another than to the other plots regrading overall species abundances
Now let’s quantify these results by finding the actual distances between these points on our ordination plot:
<- scores(nmds_Bray, display = "sites")
scores
vegdist(scores, method="euclidean")
1 2 3 4
2 0.3987570
3 0.1638629 0.2371359
4 0.3998220 0.4004870 0.3281351
5 0.4003879 0.7003845 0.5205948 0.8000000
What do these results imply?
- Plots 1 and 3 are the most similar
- Plots 4 and 5 are the most different from one another
- When looking at plot 1 where the distance to the other plots is hard to discern visually, it is closest to 2, then 4, then 5, though the distance is very similar between them and may not significantly different between these pairs
Again, NMDS is an example of an ordination technique, other examples include:
- Principal Component Analysis (PCA) is primarily used for uncovering patterns in data by reducing its dimensions linearly, without considering external variables.
- Canonical Correspondence Analysis (CCA) to explore the relationships between two sets of variables, particularly when one set is hypothesized to influence the other.
- Principal Coordinates Analysis (PCoA) focuses on preserving the pairwise distances between observations when data is reduced to a lower-dimensional space.
These ordination techniques are used to take into account many aspects, including environmental variables, to make comparisons using multiple dimensions.
Diversity Indices
Let’s bring our understanding together to quantify diversity using indices. Diversity is challenging to quantify because it is difficult to agree on a definition. For example, we could attempt to quantify biodiversity with any of the following:
- Probability of encountering a species at random
- Uncertainty of possible community states
- Variance in species composition between sites
Species Richness Index
The most common biodiversity index is the Species Richness Index, counting the number of unique species in a community. We already completed this task above, calling it \(\gamma\) diversity for our system:
<- df %>%
species_richness select(species_id) %>%
n_distinct()
species_richness
[1] 5
However, it is highly sensitive to variations in sampling effort because it treats all species equally, weighing rare and common species the same despite the greater likelihood of rare species going undetected.
It does not account for evenness: the fact that some species in the community are common and others are rare.
What do you notice about the examples above?
- They have equal species richness, however the communities are very different based on evenness.
Shannon and Simpson Diversity Indices
The Shannon(Shannon 1948) and Simpson Diversity Indices(Simpson 1949), both special cases of the Hill’s Number(Hill 1973), are very popular biodiversity indices because they focus on both richness and evenness.
\[ \begin{array}{l} Shannon-Wiener\ Diversity\ =-\sum _{i\ =\ i}^{S} p_{i}\ln p_{i}\\ \\ Simpson\ Diversity\ =\frac{1}{\sum _{i\ =\ 1}^{S} \ p_{i}^{2}}\\ \\ p_{i} \ =\ relative\ abundance\ of\ species\ i\\ S\ =\ total\ number\ of\ species \end{array} \]
You will likely see the Shannon Index referred to in different ways throughout the literature because of articles/books that were published on the topic by three different mathematicians over the course of a couple of years. Here is a quick guide to the alternative names you will encounter:
As richness and evenness increase, these indices increase. It is important to remember that Simpson’s Diversity Index emphasizes common species, while the Shannon Diversity Index is biased towards rare species. Here is how you can interpret these results:
Shannon’s Index interpretation:
- Based in information theory and represents the uncertainty that we will be able to predict which species will be randomly drawn from a community.
- Values of \(H\) (Shannon Entropy) are usually between 1.5-3.5 (the units are bits of information)
- The maximum value of \(H\) represents a perfectly even community
Simpson’s and Gini-Simpson’s Indeces interpretation:
- Represents the probability that two randomly selected individuals are from the same species.
- Simpson’s Index decreases with increasing richness, because of this Gini-Simpson index (\(1 - D\)) is often used to make value more comparable with other indices.
- The values of \(D\) range from 0 to 1
Because Simpson’s/Gini-Simpson’s indices refer more directly to richness than evenness, we can also calculate Simpson’s Evenness (“equitability”) where \(S\) is the number of species:
\[ \frac{1}{D S} \]
Now let’s calculate these values for our dataset. Thanks once again to the vegan
package, these equations are readily available through the function diversity()
. Here is the ?vegan::diversity()
help page:
Here is another nice summary as well explaining what exactly the diversity()
function is actually calculating:
Wrangle our dataframe comm_data_matrix
as needed, then compare the richness, Gini-Simpson Index, Simpson’s Evenness, and Shannon Index values for plots 1 and 2.
# Wrangle data
<- comm_data_matrix %>%
cdm_1 slice(1)
<- comm_data_matrix %>%
cdm_2 slice(2)
cdm_1
# A tibble: 1 × 5
triangle_sp square_sp cross_sp circle_sp x_sp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 2 2 2
cdm_2
# A tibble: 1 × 5
triangle_sp square_sp cross_sp circle_sp x_sp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 6 1 1 1 1
# Calculate richness
<- specnumber(cdm_1)
species_richness_1
<- specnumber(cdm_2)
species_richness_2
species_richness_1
[1] 5
species_richness_2
[1] 5
# Calculate Gini-Simpson's diversity
diversity(cdm_1, index = "simpson")
[1] 0.8
diversity(cdm_2, index = "simpson")
[1] 0.6
# Calculate Simpson's Evenness
<- diversity(cdm_1, index = "simpson")
Gs_1 <- diversity(cdm_2, index = "simpson")
Gs_2
<- 1 - Gs_1
D_1 <- 1 - Gs_2
D_2
1 / (D_1 * ncol(cdm_1))
[1] 1
1 / (D_2 * ncol(cdm_2))
[1] 0.5
# Calculate Shannon's diversity:
diversity(cdm_1, index = "shannon")
[1] 1.609438
diversity(cdm_2, index = "shannon")
[1] 1.227529
How would you interpret these results?
- Plots 1 and 2 have the same richness and very similar Gini-Simpson Index values which is related to measuring richness. However, for Shannon Entropy and even more so for Simpson’s Evenness the values the values are very distinct. Plots 1 is very even and plot 2 is very un-even. Without checking for evenness we might have missed how distinct these two communities are with plot 1 being very even and plot 2 being dominated by a single species (triangle sp.)
Now we can go through our analyses throughout this lab to answer the following questions:
Which site is most rich?
- They are all equally rich (5 species)
Which site has the least turnover?
- Plot 1
Which site is most even?
- Plot 1
Which sites are most different from one another following richness-based methods?
- Plots 4 and 5
Which sites are most similar to one another following richness-based methods?
- Plots 1 and 3
Is there a dominant species? If so, which is it?
- Triangle species (especially in plot 2)
Which sites are most different from one another following abundance-based methods?
- Plots 4 and 5
Which sites are most similar to one another following abundance-based methods?
- Plots 1 and 3