There are many ways to measure biodiversity and methods to compare the biodiversity between populations or communities. The simplest two are often considered to be abundance and richness.
Abundance
Richness is the number of species in a community while abundance is the number of individuals from each species in that community.
Let’s load our ratdat dataset
library(ratdat)head(ratdat::complete)
# A tibble: 6 × 13
record_id month day year plot_id species_id sex hindfoot_length weight
<int> <int> <int> <int> <int> <chr> <chr> <int> <int>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
How can we find abundance within this community?
We need to count the number of individuals detected from each species.
Now let’s wrangle our ratdat::complete dataset using dplyr to see the abundances of each species in our community.
df <- ratdat::complete %>%# Filter out rows with blank cells for species_idfilter(species_id !="") %>%# Subset dataframe by species_idgroup_by(species_id) %>%# Count number of rows in each subset, create a new column `abundance`summarise(abundance =n()) %>%# Put subsets back togetherungroup() %>%# Sort by abundancearrange(abundance)df
# A tibble: 48 × 2
species_id abundance
<chr> <int>
1 CS 1
2 CT 1
3 CU 1
4 CV 1
5 SC 1
6 ST 1
7 AS 2
8 RX 2
9 ZL 2
10 UL 4
# ℹ 38 more rows
Now let’s look at species by plot. Group abundances by both species and plot id. Now our dataframe will be in wide format which is preferred by the packages we will use throughout this lab.
df <- ratdat::complete %>%# Filter out rows with blank cells for species_idfilter(species_id !="") %>%# Subset dataframe by species_idgroup_by(species_id, plot_id) %>%# Count number of rows in each subset, create a new column `abundance`summarise(abundance =n()) %>%# Put subsets back togetherungroup() %>%# Convert to wide table, we will cover this again laterpivot_wider(names_from = species_id, values_from = abundance, values_fill =0)
`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
head(df)
# A tibble: 6 × 49
plot_id AB AH AS BA CB CM CQ CS CT CU CV
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 7 7 0 1 0 0 0 0 0 0 0
2 2 14 7 0 1 0 1 1 0 0 0 0
3 3 10 2 0 19 1 3 1 0 0 0 0
4 4 3 2 0 0 1 0 0 0 0 0 0
5 5 2 3 0 4 1 0 2 0 0 0 0
6 6 17 6 0 0 1 0 0 0 1 0 0
# ℹ 37 more variables: DM <int>, DO <int>, DS <int>, DX <int>, NL <int>,
# OL <int>, OT <int>, OX <int>, PB <int>, PC <int>, PE <int>, PF <int>,
# PG <int>, PH <int>, PI <int>, PL <int>, PM <int>, PP <int>, PU <int>,
# PX <int>, RF <int>, RM <int>, RO <int>, RX <int>, SA <int>, SC <int>,
# SF <int>, SH <int>, SO <int>, SS <int>, ST <int>, SU <int>, UL <int>,
# UP <int>, UR <int>, US <int>, ZL <int>
Let’s plot for species AB and AH for the first 4 plots
df_AB_AH <- df %>% dplyr::select(plot_id, AB, AH) %>%filter(plot_id <=4)ggplot(data = df_AB_AH, aes(x = AB, y = AH)) +geom_text(aes(label = plot_id)) +labs(x ="AB", y ="AH", title ="Plot IDs positioned by abundances")
Distances
Euclidean Distance
Can you tell by looking at this plot how different the relationship is between these species by site?
We can calculate the distance between plots to quantify how different they are. In general, distances in ecology are primarily used to quantify dissimilarity. The first method is calculating the Euclidean distance between points. We just look at this x, y coordinate plot as a map. Therefore, this distance can be calculated with
Calculate the Euclidean distance between plots 1 and 3
x <- df_AB_AH$AB[1] - df_AB_AH$AH[1]y <- df_AB_AH$AB[3] - df_AB_AH$AH[3]sqrt(x^2+ y^2)
[1] 8
Bray-Curtis/Sørenson Distance
However, this Euclidean distance can quickly focus heavily on more abundant species over rare species. To prevent this by weighing abundant and rare species evenly, we can calculate the abundances between two sites divided by the total abundances at each site. This is called the Bray-Curtis or Sørenson Distance.
We have focused on pairs of 2 species, but our dataframe has more than 2 columns. Let’s plot 3 columns (species AB, AH, and BA) for 4 plots.
df_AB_AH_BA <- df %>%select(plot_id, AB, AH, BA) %>%filter(plot_id <=4)# Convert to plotly 3D scatter plotplot_ly(data = df_AB_AH_BA, x =~AB, y =~AH, z =~BA, type ='scatter3d', mode ='markers') %>%layout(scene =list(xaxis =list(title ='AB'),yaxis =list(title ='AH'),zaxis =list(title ='BA'),camera =list(eye =list(x =0, y =0, z =2.5),up =list(x =0, y =1, z =0) )))
How can we visualize 4 columns (species AB, AH, BA, and SS) to see their distance?
One technique that attempts to visualize an optimal arrangement is Non-metric Multidimensional Scaling (NMDS). Note that the axes of the plotted results are arbitrary (non-metric) but instead represent the relative weight of relationships.
Once again, vegan can help with this, among other options it can produce NMDS plots by either Euclidean or Bray-Curtis distance.
Wisconsin double standardization
Run 0 stress 0
Run 1 stress 0
... Procrustes: rmse 0.133155 max resid 0.2017735
Run 2 stress 8.095724e-05
... Procrustes: rmse 0.2224001 max resid 0.3350531
Run 3 stress 0
... Procrustes: rmse 0.1221689 max resid 0.201466
Run 4 stress 0
... Procrustes: rmse 0.1656368 max resid 0.2820223
Run 5 stress 0
... Procrustes: rmse 0.1549081 max resid 0.2523481
Run 6 stress 0
... Procrustes: rmse 0.1745811 max resid 0.2837019
Run 7 stress 0
... Procrustes: rmse 0.141142 max resid 0.2240949
Run 8 stress 0
... Procrustes: rmse 0.1253157 max resid 0.188246
Run 9 stress 0
... Procrustes: rmse 0.1275168 max resid 0.1783719
Run 10 stress 0
... Procrustes: rmse 0.05976498 max resid 0.08892813
Run 11 stress 7.12166e-05
... Procrustes: rmse 0.229479 max resid 0.340543
Run 12 stress 0
... Procrustes: rmse 0.1754167 max resid 0.2148794
Run 13 stress 0
... Procrustes: rmse 0.1259958 max resid 0.1742557
Run 14 stress 0
... Procrustes: rmse 0.1935217 max resid 0.316349
Run 15 stress 0
... Procrustes: rmse 0.08341071 max resid 0.1386356
Run 16 stress 0
... Procrustes: rmse 0.2032435 max resid 0.310633
Run 17 stress 0
... Procrustes: rmse 0.1188598 max resid 0.1824606
Run 18 stress 0
... Procrustes: rmse 0.184431 max resid 0.3011979
Run 19 stress 0
... Procrustes: rmse 0.04545514 max resid 0.06604594
Run 20 stress 7.539867e-05
... Procrustes: rmse 0.2283102 max resid 0.3411722
*** Best solution was not repeated -- monoMDS stopping criteria:
20: stress < smin
Warning in metaMDS(df_AB_AH_BA_SS %>% select(-plot_id), distance =
"euclidean"): stress is (nearly) zero: you may have insufficient data
scores_plot_df <-as.data.frame(scores(nmds_Euc, display ="sites"))scores_plot_df$plot_id <- df_AB_AH_BA_SS$plot_idggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +geom_text(aes(label = plot_id)) +labs(title ="NMDS Plot with Euclidean Distance", x ="NMDS_1", y ="NMDS_2")
# Repeat for Bray-Curtis distancenmds_Bray <-metaMDS(df_AB_AH_BA_SS %>%select(-plot_id), distance ="bray")
Wisconsin double standardization
Run 0 stress 0
Run 1 stress 0
... Procrustes: rmse 0.1231563 max resid 0.2073613
Run 2 stress 0
... Procrustes: rmse 0.1339593 max resid 0.1593201
Run 3 stress 0
... Procrustes: rmse 0.1699433 max resid 0.1978182
Run 4 stress 0
... Procrustes: rmse 0.06628414 max resid 0.09287973
Run 5 stress 0
... Procrustes: rmse 0.1668645 max resid 0.1899495
Run 6 stress 0
... Procrustes: rmse 0.08747335 max resid 0.128069
Run 7 stress 0
... Procrustes: rmse 0.04918992 max resid 0.06570157
Run 8 stress 0
... Procrustes: rmse 0.1731014 max resid 0.2026734
Run 9 stress 0
... Procrustes: rmse 0.1573814 max resid 0.164574
Run 10 stress 0
... Procrustes: rmse 0.07973617 max resid 0.1322914
Run 11 stress 0
... Procrustes: rmse 0.06287776 max resid 0.0839981
Run 12 stress 0
... Procrustes: rmse 0.1805427 max resid 0.2133046
Run 13 stress 0
... Procrustes: rmse 0.08993397 max resid 0.1190303
Run 14 stress 0
... Procrustes: rmse 0.1041228 max resid 0.1711877
Run 15 stress 2.930413e-05
... Procrustes: rmse 0.08119986 max resid 0.1253618
Run 16 stress 0
... Procrustes: rmse 0.03147627 max resid 0.05260101
Run 17 stress 0
... Procrustes: rmse 0.07743176 max resid 0.1303473
Run 18 stress 0
... Procrustes: rmse 0.08985485 max resid 0.1146704
Run 19 stress 0
... Procrustes: rmse 0.120438 max resid 0.1997085
Run 20 stress 0
... Procrustes: rmse 0.05740063 max resid 0.07307026
*** Best solution was not repeated -- monoMDS stopping criteria:
20: stress < smin
Warning in metaMDS(df_AB_AH_BA_SS %>% select(-plot_id), distance = "bray"):
stress is (nearly) zero: you may have insufficient data
Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
half-change scaling: too few points below threshold
scores_plot_df <-as.data.frame(scores(nmds_Bray, display ="sites"))scores_plot_df$plot_id <- df_AB_AH_BA_SS$plot_idggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +geom_text(aes(label = plot_id)) +labs(title ="NMDS Plot with Bray-Curtis Distance", x ="NMDS_1", y ="NMDS_2")
Compare this to our original plot, are there any differences that you notice?
df_AB_AH <- df %>% dplyr::select(plot_id, AB, AH) %>%filter(plot_id <=4)ggplot(data = df_AB_AH, aes(x = AB, y = AH)) +geom_text(aes(label = plot_id)) +labs(x ="AB", y ="AH", title ="Plot IDs positioned by abundances")
NMDS plots can go well beyond 4-dimensions, because of this expandability, we can plot distances for all species in our sample simultaneously to identify which plots may be more different from one another as far as species abundances is concerned.
# Repeat creating NMDS using Bray-Curtis distance for all speciesnmds_Bray <-metaMDS(df %>%select(-plot_id), distance ="bray")
Square root transformation
Wisconsin double standardization
Run 0 stress 0.1732437
Run 1 stress 0.1692261
... New best solution
... Procrustes: rmse 0.07820723 max resid 0.2409317
Run 2 stress 0.3762069
Run 3 stress 0.197087
Run 4 stress 0.1786437
Run 5 stress 0.1760508
Run 6 stress 0.1786437
Run 7 stress 0.1786437
Run 8 stress 0.1687115
... New best solution
... Procrustes: rmse 0.05406268 max resid 0.185574
Run 9 stress 0.1796872
Run 10 stress 0.1830521
Run 11 stress 0.3840954
Run 12 stress 0.1666806
... New best solution
... Procrustes: rmse 0.04441907 max resid 0.1409255
Run 13 stress 0.173811
Run 14 stress 0.1763763
Run 15 stress 0.1819952
Run 16 stress 0.1758261
Run 17 stress 0.188976
Run 18 stress 0.1742797
Run 19 stress 0.1692261
Run 20 stress 0.1823301
*** Best solution was not repeated -- monoMDS stopping criteria:
18: stress ratio > sratmax
2: scale factor of the gradient < sfgrmin
scores_plot_df <-as.data.frame(scores(nmds_Bray, display ="sites"))scores_plot_df$plot_id <- df$plot_idggplot(scores_plot_df, aes(x = NMDS1, y = NMDS2)) +geom_text(aes(label = plot_id)) +labs(title ="NMDS Plot with Bray-Curtis Distance", x ="NMDS_1", y ="NMDS_2")
What hypotheses can we make based on this plot?
We can make various hypotheses, such as that plots 10 and 23 are more similar to one another than to the other plots regrading species abundances
NMDS is an example of an ordination technique, which allow us to arrange multidimensional data and reduce it to low-dimensional space based on their similarities or dissimilarities. This is very useful for visualizing and analyzing complex data. Other examples include:
Principal Component Analysis (PCA) is primarily used for uncovering patterns in data by reducing its dimensions linearly, without considering external variables.
Canonical Correspondence Analysis (CCA) to explore the relationships between two sets of variables, particularly when one set is hypothesized to influence the other.
Principal Coordinates Analysis (PCoA) focuses on preserving the pairwise distances between observations when data is reduced to a lower-dimensional space.
All of these ordination methods do something more or less like this following visualization of a simple PCA, they look at multivariate point cloud from a certain angle and take a 2D slice through that plane:
Now that we have explored abundance, let’s return to richness as a biodiversity metric. Do you remember the three ways we have talked about measuring diversity this semester?
Gamma Diversity
What is the species richness of our region for ratdat::complete?
How similar are these plots? How distinct are they? \(\alpha\) diversity can start to answer these questions, but doesn’t provide too clear of an answer.
Calculating \(\beta\) diversity can begin to help us explore their differences.
Beta Diversity
\(\beta\)diversity can be defined as the number of species in a region not observed in sample or the turnover in species composition among sites. Therefore both \(\gamma\) (regional) and \(\alpha\) (site) diversity are directly related to \(\beta\) diversity.
\(\beta\) diversity is often calculated via additive partitioning.
\[
\beta \ =\ \gamma \ -\ \overline{\alpha }
\]
Where \(\overline{\alpha }\) is the average \(\alpha\) diversity of the sample. Here, \(\beta\) diversity is the average number of species not found in a sample even though they are found in a region.
It is also commonly calculated using multiplicative partitioning:
This metric for turnover aids in determining how many unique assemblages exist in a region.
As with any step of model selection, you must be intentional about which version of \(\beta\) diversity is applicable to your question and sample.
We will calculate the beta diversity of sites 7 and 12. Once again, we will be using the vegan package to help with this.
First, we will need to reformat the data into the format that vegan prefers called a “Community data matrix,” with species names across the X-axis of the dataframe, sampling event down the Y-axis, and abundances where they meet. We will widen the dataframe by species name. To do so, we will use the tidyRpivot_wider() function.
comm_data_matrix <- ratdat::complete %>%# Filter out blank species_idsfilter(species_id !="") %>%# Add a count of one for each samplemutate(count =1) %>%# Widen based on speciespivot_wider(names_from = species_id, values_from = count, values_fill =0) %>%# Create a year_month_day columnmutate(year_month_day =sprintf('%04d_%02d_%02d', year, month, day)) %>%# Select only for year_month_day, plot_id, and species columnsselect(-year, -month, -day, -record_id, -sex, -hindfoot_length, -weight, -genus, -species, -taxa, -plot_type) %>%# Combine unique year_month_days/plot_id combination and sum species countsunite(year_month_day, year_month_day) %>%group_by(year_month_day, plot_id) %>%summarise_all(sum) %>%ungroup()head(comm_data_matrix)
# A tibble: 6 × 50
year_month_day plot_id NL DM PF PE DS PP SH OT DO
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1977_07_16 1 0 2 1 0 0 0 0 0 0
2 1977_07_16 2 1 1 0 1 0 1 0 0 0
3 1977_07_16 3 1 2 0 0 1 0 0 0 0
4 1977_07_16 4 0 1 1 0 0 0 0 0 0
5 1977_07_16 5 0 0 0 0 1 0 0 0 0
6 1977_07_16 6 0 1 1 0 0 0 0 0 0
# ℹ 39 more variables: OX <dbl>, SS <dbl>, OL <dbl>, RM <dbl>, SA <dbl>,
# PM <dbl>, AH <dbl>, DX <dbl>, AB <dbl>, CB <dbl>, CM <dbl>, CQ <dbl>,
# RF <dbl>, PC <dbl>, PG <dbl>, PH <dbl>, PU <dbl>, CV <dbl>, UR <dbl>,
# UP <dbl>, ZL <dbl>, UL <dbl>, CS <dbl>, SC <dbl>, BA <dbl>, SF <dbl>,
# RO <dbl>, AS <dbl>, SO <dbl>, PI <dbl>, ST <dbl>, CU <dbl>, SU <dbl>,
# RX <dbl>, PB <dbl>, PL <dbl>, PX <dbl>, CT <dbl>, US <dbl>
To calculate \(\overline{\alpha }\), use the vegan function specnumber():
alpha_bar <-mean(specnumber(df))
This will need to be done after the dataframe has only species columns, because that is the format vegan prefers.
With this information, calculate beta diversity for sites 7 and 12.
Additive partitioned \(\beta\) diversity is similar between the sites, therefore similar average number of species not found in these sites similar
However, multiplicative partitioned \(\beta\) diversity is different between the sites, so the turnover compared to the region different
Because \(\beta \ >\ \overline{\alpha }\), each plot misses more species than it finds.
Diversity Indices
Let’s bring our understanding together to quantify diversity using indices. Diversity is challenging to quantify because it is difficult to agree on a definition. For example, we could attempt to quantify biodiversity with any of the following:
Probability of encountering a species at random
Uncertainty of possible community states
Variance in species composition between sites
Species Richness Index
The most common biodiversity index is the Species Richness Index, counting the number of unique species in a community. We already completed this task above, calling it \(\gamma\) diversity for our system:
However, it is highly sensitive to variations in sampling effort because it treats all species equally, weighing rare and common species equally despite the greater likelihood of rare species going undetected.
It does not account for evenness: the fact that some species in the community are common and others are rare.
They have equal species richness, however the communities are very different based on evenness.
Shannon and Simpson Diversity Indices
The Shannon(Shannon 1948) and Simpson Diversity Indices(Simpson 1949), both special cases of the Hill’s Number(Hill 1973), are very popular biodiversity indices because they focus on both richness and evenness.
You will likely see the Shannon Index referred to in different ways throughout the literature because of articles/books that were published on the topic by three different mathematicians over the course of a couple of years. Here is a quick guide to the alternative names you will encounter:
As richness and evenness increase, these indices increase. It is important to remember that Simpson’s Diversity Index emphasizes common species, while the Shannon Diversity Index is biased towards rare species. Here is how you can interpret these results:
Shannon’s Index interpretation:
Shannon’s Index is based in information theory and represents the uncertainty that we will be able to predict which species will be randomly drawn from a community.
Values of \(H\) (Shannon Entropy) are usually between 1.5-3.5 (the units are bits of information)
The maximum value of \(H\) represents a perfectly even community
Simpson’s Index interpretation:
Simpson’s Index represents the probability that two randomly selected individuals are from the same species.
Simpson’s Index decreases with increasing richness, because of this Gini-Simpson index (\(1 - D\)) is often used to make value more comparable with other indices.
The values of \(D\) range from 0 to 1
Because Simpson’s/Gini-Simpson’s indices refer more directly to richness than evenness, we can also calculate Simpson’s Evenness (“equitability”) where \(S\) is the number of species:
\[
\frac{1}{DS}
\]
Now let’s calculate these values for our dataset. Thanks once again to the vegan package, these equations are readily available through the function diversity(). Here is the ?vegan::diversity() help page:
Here is another nice summary as well explaining what exactly the diversity() function is actually calculating:
Let’s assume plots 1-8 and 17-24 represent two distinct regions. Wrangle our dataframe comm_data_matrix to combine plots (hint, use the summarize(sum) function) and remove the date column. Then compare the Gini-Simpson Index, Simpson’s Evenness, and Shannon Index values for plots 1-8 versus plots 17-24.
# Calculate Simpson's EvennessGs_1_8 <-diversity(df_1_8, index ="simpson")Gs_17_24 <-diversity(df_17_24, index ="simpson")D_1_8 <-1- Gs_1_8D_17_24 <-1- Gs_17_241/ (D_1_8 *ncol(df_1_8))
[1] 0.1591402
1/ (D_17_24 *ncol(df_17_24))
[1] 0.1889005
# Calculate Shannon's diversity:diversity(df_1_8, index ="shannon")
[1] 2.377345
diversity(df_17_24, index ="shannon")
[1] 2.513817
How would you interpret these results?
Plots 1-8 and plots 17-24 have a relatively similar value for the Gini-Simpson’s Index, with plots 17-24 having slightly higher richness overall
Plots 1-8 and plots 17-24 have a relatively similar value for the Simpson’s Evenness, with plots 17-24 being slightly more even
In agreement with the Simpson’s Evenness value, when emphasizing rare species using the Shannon’s Diversity Index, plots 1-8 and plots 17-24 have a relatively similar value for the Shannons’s Diversity Index (Shannon’s Entropy), with plots 17-24 being slightly more even. Overall, this is a relatively high value of \(H\) indicating relatively high evenness
In summary, these sets of plots seem to be similar when looking at evenness
References
Hill, Mark O. 1973. “Diversity and Evenness: A Unifying Notation and Its Consequences.”Ecology 54 (2): 427–32.
Mittelbach, Gary G, and Brian J McGill. 2019. Community Ecology. Oxford University Press.
Oksanen, Jari, Gavin L. Simpson, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O’Hara, et al. 2022. “Vegan: Community Ecology Package.”https://CRAN.R-project.org/package=vegan.
Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.”The Bell System Technical Journal 27 (3): 379–423.
Simpson, Edward H. 1949. “Measurement of Diversity.”Nature 163 (4148): 688–88.
Stevens, Martin Henry Hoffman. 2009. A Primer of Ecology with r. Springer.