rm(list = ls())Charles’ Data Cleaning Tips
Principles
- Remember the 2 steps of environment set-up:
Clear your environment
- This ensures that when you re-run the whole script, data and dataframes from other scripts will not interfere with your current script.
- Load your packages
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)Don’t modify your file (.csv, .rds, .RData, .xlsx, etc.) directly.
- It should be kept in its raw form. This is so that your workflow in R is reproducible by someone else if they download the same dataset. Instead, modify the contents through your data cleaning in R.
When importing your data, import the raw data as its own data frame that you do not modify further.
- When cleaning, assign that to a new variable. That way when you re-run all of your code, if you modified the data frame throughout, it will always start your analysis with the raw data like it should.
df_raw <- data.frame()
df_clean <- df_raw %>%
filter() %>%
select()When naming your own columns, avoid spaces. Use underscores ( _ ) instead
When naming your own files or folders, avoid spaces. Use underscores ( _ ) instead
Remember to be an explicit programmer when naming objects!
- Throughout this handout, I use object generic names like
df_rawanddf_clean. For your actual datasets you should give them meaningful names instead ofdfthat help describe the data!
- Throughout this handout, I use object generic names like
Remember the 3 steps of preparing data:
Import data
Inspect data
Clean data
Importing data
Can’t find your file?
df_raw <- read.csv(file = "data.csv")If you receive an error like:
Error in file(file, "rt") : cannot open the connectionThis means the the function cannot locate your file
Step 1: Ensure the File is in the Correct Location
- If you use the code below, This tells R to look for the
"data.csv"file in the same folder as your script. Make sure the file is in the correct folder:
- If you use the code below, This tells R to look for the
df_raw <- read.csv(file = "data.csv")Note, if you want to explicitly indicate the current folder where the .qmd file is located, you can use ./:
df_raw <- read.csv(file = "./data.csv")This is functionally equivalent to omitting ./, as R assumes the current folder by default.
Step 2: Specify a Different Folder
First, check the location of your working directory, which is the folder your script is currently looking for files in:
getwd()[1] "C:/Users/Charles/Documents/GitHub/BISC_404_Ecology_and_Biodiversity/Labs/Lab_3"
Note: if this doesn’t match your expectations, you can use setwd() to set the correct folder
If your "data.csv" file is not in the same folder as your script file, you can direct R to look in another folder. For example:
- If your script file is in a folder called
Website_files, and the"data.csv"file is in a subfolder calledcleaning_raw_data, you can specify the path like this:
data_raw <- read.csv(file = "cleaning_raw_data/data.csv")- If your
"data.csv"file is in the parent folder (the folder that containsLab_3), use../to indicate “go up one folder”:
data_raw <- read.csv(file = "../cleaning_raw_data/data.csv")Key Takeaways
- Use
./to indicate the current folder (though it is usually implied by default). - Use
../to move up a folder in the directory hierarchy. - Use
<subfolder_name>/to specify a subfolder within the current folder.
Double-check your file’s location and path to ensure successful loading
Importing R Data Files
.rds and .rda (or .Rdata) files are used to store data natively in R.
The main advantage of .rda files is that they can contain a variety of data in a single file (objects, plots, functions, packages, etc.), however when loading a .rda file you run the risk of over-writing existing objects, plots, etc. with the same name.
.rds files can only contain a single data structure. A benefit of this file type is that loading a .rds file will not overwrite data structures with the same name.
df_raw <- readRDS(file = "cleaning_raw_data/data.rds")
df_raw# A tibble: 7 × 3
`eggs (per ml)` `dead animals (per ml)` `external medium (mu mol N / l)`
<dbl> <dbl> <dbl>
1 NA NA 80
2 0 0.4 80
3 7.02 0 80
4 3.41 0.4 80
5 6.82 0.4 80
6 3.41 0.2 80
7 14.4 1 80
Importing a CSV
data_raw <- read.csv(file = "cleaning_raw_data/data.csv")
data_raw eggs..per.ml. dead.animals..per.ml. external.medium..mu.mol.N...l.
1 NA NA 80
2 0.00 0.4 80
3 7.02 0.0 80
4 3.41 0.4 80
5 6.82 0.4 80
6 3.41 0.2 80
7 14.45 1.0 80
Odd column name formatting from importing your .csv?
library(readr)
df_raw <- read_csv(file = "cleaning_raw_data/data.csv") Rows: 7 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): eggs (per ml), dead animals (per ml), external medium (mu mol N / l)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_raw# A tibble: 7 × 3
`eggs (per ml)` `dead animals (per ml)` `external medium (mu mol N / l)`
<dbl> <dbl> <dbl>
1 NA NA 80
2 0 0.4 80
3 7.02 0 80
4 3.41 0.4 80
5 6.82 0.4 80
6 3.41 0.2 80
7 14.4 1 80
Importing CSV Troubleshooting
df_raw <- read_csv(file = "cleaning_raw_data/data.csv.txt")Rows: 9 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): eggs (per ml);dead animals (per ml);external medium (mu mol N / l)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_raw# A tibble: 9 × 1
`eggs (per ml);dead animals (per ml);external medium (mu mol N / l)`
<chr>
1 NA;NA;80
2 0;0.4;80
3 7.02;0;80
4 3.41;0.4;80
5 6.82;0.4;80
6 3.41;0.2;80
7 14.45;1;80
8 22;2;NA
9 0;-16;80
Why is this .csv file not loading correctly?
Check the delimiters. Remember that .csv stands for “comma-separated file,” but this file looks like it has semi-colons (;) as separators. We can use read_delim() for ultimate flexibility.
library(readr)
df_raw <- read_delim(file = "cleaning_raw_data/data.csv.txt", delim = ";")Rows: 9 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (3): eggs (per ml), dead animals (per ml), external medium (mu mol N / l)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_raw# A tibble: 9 × 3
`eggs (per ml)` `dead animals (per ml)` `external medium (mu mol N / l)`
<dbl> <dbl> <dbl>
1 NA NA 80
2 0 0.4 80
3 7.02 0 80
4 3.41 0.4 80
5 6.82 0.4 80
6 3.41 0.2 80
7 14.4 1 80
8 22 2 NA
9 0 -16 80
Importing an Excel file
library(readxl)
df_raw <- read_excel("cleaning_raw_data/data.xlsx")
df_raw# A tibble: 0 × 1
# ℹ 1 variable: blank <lgl>
Data missing? Check the sheet
# Either by number
df_raw <- read_excel("cleaning_raw_data/data.xlsx", sheet = 2)New names:
• `` -> `...1`
• `` -> `...3`
• `` -> `...4`
df_raw# A tibble: 10 × 4
...1 `Table 1` ...3 ...4
<chr> <chr> <chr> <chr>
1 treatment number eggs (per ml) dead animals (per ml) external medium (mu mol…
2 1 NA NA 80
3 2 0 0.4 80
4 3 7.02 0 80
5 4 3.41 0.4 80
6 5 6.82 0.4 80
7 6 3.41 0.2 80
8 7 14.45 1 80
9 8 3 0.2 NA
10 9 4.3 -16 80
# Or by name
df_raw <- read_excel("cleaning_raw_data/data.xlsx", sheet = "Sheet2")New names:
• `` -> `...1`
• `` -> `...3`
• `` -> `...4`
df_raw# A tibble: 10 × 4
...1 `Table 1` ...3 ...4
<chr> <chr> <chr> <chr>
1 treatment number eggs (per ml) dead animals (per ml) external medium (mu mol…
2 1 NA NA 80
3 2 0 0.4 80
4 3 7.02 0 80
5 4 3.41 0.4 80
6 5 6.82 0.4 80
7 6 3.41 0.2 80
8 7 14.45 1 80
9 8 3 0.2 NA
10 9 4.3 -16 80
Exploring Data
There are many ways to explore your raw data! Here are some examples:
head(df_raw)# A tibble: 6 × 4
...1 `Table 1` ...3 ...4
<chr> <chr> <chr> <chr>
1 treatment number eggs (per ml) dead animals (per ml) external medium (mu mol …
2 1 NA NA 80
3 2 0 0.4 80
4 3 7.02 0 80
5 4 3.41 0.4 80
6 5 6.82 0.4 80
summary(df_raw) ...1 Table 1 ...3 ...4
Length:10 Length:10 Length:10 Length:10
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
names(df_raw)[1] "...1" "Table 1" "...3" "...4"
dim(df_raw)[1] 10 4
str(df_raw)tibble [10 × 4] (S3: tbl_df/tbl/data.frame)
$ ...1 : chr [1:10] "treatment number" "1" "2" "3" ...
$ Table 1: chr [1:10] "eggs (per ml)" "NA" "0" "7.02" ...
$ ...3 : chr [1:10] "dead animals (per ml)" "NA" "0.4" "0" ...
$ ...4 : chr [1:10] "external medium (mu mol N / l)" "80" "80" "80" ...
And here are some rough, exploratory plot examples. Note I am just using them to explore the raw data, so did not include labels like you should in your real analysis.
ggplot(data = df_raw, aes(x = as.numeric(`Table 1`), y = as.numeric(`...3`))) +
geom_point() +
labs(x = "Eggs (per ml)", y = "Dead animals (per ml)")Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
ggplot(data = df_raw, aes(x = as.numeric(`Table 1`), y = as.numeric(`...3`))) +
geom_point() +
geom_smooth() +
labs(x = "Eggs (per ml)", y = "Dead animals (per ml)")Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
ggplot(data = df_raw, aes(x = as.numeric(`Table 1`))) +
geom_histogram() +
labs(x = "Eggs (per ml)")Warning in FUN(X[[i]], ...): NAs introduced by coercion
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
ggplot(data = df_raw, aes(x = as.numeric(`...3`))) +
geom_histogram() +
labs(x = "Dead animals (per ml)")Warning in FUN(X[[i]], ...): NAs introduced by coercion
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
ggplot(data = df_raw, aes(x = seq_along(as.numeric(`Table 1`)), y = as.numeric(`Table 1`))) +
geom_line() +
labs(y = "Eggs (per ml)")Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
ggplot(data = df_raw, aes(x = seq_along(as.numeric(`...3`)), y = as.numeric(`...3`))) +
geom_line() +
labs(y = "Dead animals (per ml)")Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
ggplot(data = df_raw, aes(x = "", y = as.numeric(`Table 1`))) +
geom_boxplot() +
geom_jitter() +
labs(y = "Eggs (per ml)")Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
ggplot(data = df_raw, aes(x = "", y = as.numeric(`...3`))) +
geom_boxplot() +
geom_jitter() +
labs(y = "Dead animals (per ml)")Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
But your data exploration does not stop at running these functions! You now must critically look at the results and interpret what they mean. Then clearly describe what you are seeing to yourself.
Cleaning Data
Cleaning column names
Now how do you reference these column names? With backticks ( ` ) like we did above:
ggplot(data = df_raw, aes(y = `Table 1`)) +
geom_bar()Your titles are in the incorrect row because of spreadsheet formatting?
df_clean <- df_raw
df_clean# A tibble: 10 × 4
...1 `Table 1` ...3 ...4
<chr> <chr> <chr> <chr>
1 treatment number eggs (per ml) dead animals (per ml) external medium (mu mol…
2 1 NA NA 80
3 2 0 0.4 80
4 3 7.02 0 80
5 4 3.41 0.4 80
6 5 6.82 0.4 80
7 6 3.41 0.2 80
8 7 14.45 1 80
9 8 3 0.2 NA
10 9 4.3 -16 80
# Use first row as column names
names(df_clean) <- df_raw %>%
filter(row_number() == 1)Warning: The `value` argument of `names<-()` must be a character vector as of tibble
3.0.0.
df_clean# A tibble: 10 × 4
`treatment number` `eggs (per ml)` `dead animals (per ml)`
<chr> <chr> <chr>
1 treatment number eggs (per ml) dead animals (per ml)
2 1 NA NA
3 2 0 0.4
4 3 7.02 0
5 4 3.41 0.4
6 5 6.82 0.4
7 6 3.41 0.2
8 7 14.45 1
9 8 3 0.2
10 9 4.3 -16
# ℹ 1 more variable: `external medium (mu mol N / l)` <chr>
# Filter out first row
df_clean <- df_clean %>%
filter(row_number() != 1)
df_clean# A tibble: 9 × 4
`treatment number` `eggs (per ml)` `dead animals (per ml)`
<chr> <chr> <chr>
1 1 NA NA
2 2 0 0.4
3 3 7.02 0
4 4 3.41 0.4
5 5 6.82 0.4
6 6 3.41 0.2
7 7 14.45 1
8 8 3 0.2
9 9 4.3 -16
# ℹ 1 more variable: `external medium (mu mol N / l)` <chr>
How do you replace these spaces with underscores to make the columns easier to works with?
- Option 1: One at a time (manually):
df_clean1 <- df_clean %>%
rename(eggs_per_ml = `eggs (per ml)`)
df_clean1# A tibble: 9 × 4
`treatment number` eggs_per_ml `dead animals (per ml)` external medium (mu m…¹
<chr> <chr> <chr> <chr>
1 1 NA NA 80
2 2 0 0.4 80
3 3 7.02 0 80
4 4 3.41 0.4 80
5 5 6.82 0.4 80
6 6 3.41 0.2 80
7 7 14.45 1 80
8 8 3 0.2 NA
9 9 4.3 -16 80
# ℹ abbreviated name: ¹`external medium (mu mol N / l)`
We also need to get rid of the other troublesome punctuation, like the parentheses, because they will mess with our functions. Imagine if we tried to run the following. How would R try to interpret that?
df_clean <- df_raw %>%
rename(external_medium_(mu_mol_N_/_l) = `external medium (mu mol N / l)`)Error in parse(text = input): <text>:2:38: unexpected symbol
1: df_clean <- df_raw %>%
2: rename(external_medium_(mu_mol_N_/_l
^
- Option 2: Among other ways to accomplish removing all troublesome punctuation, here are some options:
df_clean2 <- df_clean %>%
rename_with(~ gsub("[^A-Za-z0-9]", "_", .x))
df_clean2# A tibble: 9 × 4
treatment_number eggs__per_ml_ dead_animals__per_ml_ external_medium__mu_mol…¹
<chr> <chr> <chr> <chr>
1 1 NA NA 80
2 2 0 0.4 80
3 3 7.02 0 80
4 4 3.41 0.4 80
5 5 6.82 0.4 80
6 6 3.41 0.2 80
7 7 14.45 1 80
8 8 3 0.2 NA
9 9 4.3 -16 80
# ℹ abbreviated name: ¹external_medium__mu_mol_N___l_
library(stringr)
df_clean3 <- df_clean %>%
rename_with(~ str_replace_all(.x, "[^A-Za-z0-9]", "_"))
df_clean3# A tibble: 9 × 4
treatment_number eggs__per_ml_ dead_animals__per_ml_ external_medium__mu_mol…¹
<chr> <chr> <chr> <chr>
1 1 NA NA 80
2 2 0 0.4 80
3 3 7.02 0 80
4 4 3.41 0.4 80
5 5 6.82 0.4 80
6 6 3.41 0.2 80
7 7 14.45 1 80
8 8 3 0.2 NA
9 9 4.3 -16 80
# ℹ abbreviated name: ¹external_medium__mu_mol_N___l_
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
df_clean <- df_clean %>%
clean_names()
df_clean# A tibble: 9 × 4
treatment_number eggs_per_ml dead_animals_per_ml external_medium_mu_mol_n_l
<chr> <chr> <chr> <chr>
1 1 NA NA 80
2 2 0 0.4 80
3 3 7.02 0 80
4 4 3.41 0.4 80
5 5 6.82 0.4 80
6 6 3.41 0.2 80
7 7 14.45 1 80
8 8 3 0.2 NA
9 9 4.3 -16 80
Assorted cleaning
As you explore your dataset, keep an eye out for potential red flags that may indicate data errors. Here are some key issues to check for:
NAs:
NA(not applicable) orNaN(not a number) valuesOutliers: Values that are far from the mean may indicate errors. Check using
summary(), scatter plots, histograms.Numbers that don’t make sense:
Can there be negative counts of animals? Probably not!
If your study covers 2012-2017, but there’s a data point from 1900, it’s likely a data error.
If your study is in North America, but you see a point at lat/lon (0,0), it could be a coordinate entry error.
Percentages over 100%
Etc.
Incorrect Data Types: Remember that incorrect datatypes can cause analysis errors, check the datatypes of every column to make sure they are what you want
Dates in Incorrect Format: Use the
lubridatepackage’s formulas to convert to date data object. For example, if input data in format of month/day/year then usemutate(<new date column name> = mdy(<existing date column name>))in your cleaning pipe
What red flags do you see from our current dataset?
summary(df_clean) treatment_number eggs_per_ml dead_animals_per_ml
Length:9 Length:9 Length:9
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
external_medium_mu_mol_n_l
Length:9
Class :character
Mode :character
Should these be character/string data types?
df_clean <- df_clean %>%
mutate(eggs_per_ml = as.numeric(eggs_per_ml)) %>%
mutate(dead_animals_per_ml = as.numeric(dead_animals_per_ml)) %>%
mutate(external_medium_mu_mol_n_l = as.numeric(external_medium_mu_mol_n_l))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `eggs_per_ml = as.numeric(eggs_per_ml)`.
Caused by warning:
! NAs introduced by coercion
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `dead_animals_per_ml = as.numeric(dead_animals_per_ml)`.
Caused by warning:
! NAs introduced by coercion
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `external_medium_mu_mol_n_l =
as.numeric(external_medium_mu_mol_n_l)`.
Caused by warning:
! NAs introduced by coercion
summary(df_clean) treatment_number eggs_per_ml dead_animals_per_ml
Length:9 Min. : 0.000 Min. :-16.000
Class :character 1st Qu.: 3.308 1st Qu.: 0.150
Mode :character Median : 3.855 Median : 0.300
Mean : 5.301 Mean : -1.675
3rd Qu.: 6.870 3rd Qu.: 0.400
Max. :14.450 Max. : 1.000
NA's :1 NA's :1
external_medium_mu_mol_n_l
Min. :80
1st Qu.:80
Median :80
Mean :80
3rd Qu.:80
Max. :80
NA's :1
Does -16 make sense for a value for density of dead animals?
df_clean <- df_clean %>%
filter(dead_animals_per_ml > 0)
summary(df_clean) treatment_number eggs_per_ml dead_animals_per_ml
Length:6 Min. : 0.000 Min. :0.2000
Class :character 1st Qu.: 3.103 1st Qu.:0.2500
Mode :character Median : 3.410 Median :0.4000
Mean : 5.182 Mean :0.4333
3rd Qu.: 5.968 3rd Qu.:0.4000
Max. :14.450 Max. :1.0000
external_medium_mu_mol_n_l
Min. :80
1st Qu.:80
Median :80
Mean :80
3rd Qu.:80
Max. :80
NA's :1
Removing NAs
- First check for NAs
summary(df_clean) treatment_number eggs_per_ml dead_animals_per_ml
Length:6 Min. : 0.000 Min. :0.2000
Class :character 1st Qu.: 3.103 1st Qu.:0.2500
Mode :character Median : 3.410 Median :0.4000
Mean : 5.182 Mean :0.4333
3rd Qu.: 5.968 3rd Qu.:0.4000
Max. :14.450 Max. :1.0000
external_medium_mu_mol_n_l
Min. :80
1st Qu.:80
Median :80
Mean :80
3rd Qu.:80
Max. :80
NA's :1
- Then you need to decide:
- Option 1) Are you going to remove allrows that have NAs?
- Do this only if this will not eliminate data we are interested in
library(tidyr)
df_clean1 <- df_clean %>%
clean_names() %>%
drop_na()
df_clean1# A tibble: 5 × 4
treatment_number eggs_per_ml dead_animals_per_ml external_medium_mu_mol_n_l
<chr> <dbl> <dbl> <dbl>
1 2 0 0.4 80
2 4 3.41 0.4 80
3 5 6.82 0.4 80
4 6 3.41 0.2 80
5 7 14.4 1 80
Note: this was not the right choice in our case because we lost the egg and dead animal density data we are interested in from this column because of the NA in the media column! As long as we assume this NA was a data error and it had the same amount of media as the rest, we should preserve that data for our analysis.
Option 2) Are you going to remove certain rows that have NAs?
- Choose this if you want to specify each individual column you are interested in
# If we are only interested in analyzing the "eggs_per_ml" and "dead_animals_per_ml" columns
df_clean2 <- df_clean %>%
clean_names() %>%
filter(!is.na(eggs_per_ml)) %>%
filter(!is.na(dead_animals_per_ml))
df_clean2# A tibble: 6 × 4
treatment_number eggs_per_ml dead_animals_per_ml external_medium_mu_mol_n_l
<chr> <dbl> <dbl> <dbl>
1 2 0 0.4 80
2 4 3.41 0.4 80
3 5 6.82 0.4 80
4 6 3.41 0.2 80
5 7 14.4 1 80
6 8 3 0.2 NA
- Option 3) Are you going to remove columns that have NAs
- This is a safe option to choose certain columns are not important to your analysis
# Remove the column we are not interested in
df_clean <- df_clean %>%
select(-external_medium_mu_mol_n_l)
df_clean# A tibble: 6 × 3
treatment_number eggs_per_ml dead_animals_per_ml
<chr> <dbl> <dbl>
1 2 0 0.4
2 4 3.41 0.4
3 5 6.82 0.4
4 6 3.41 0.2
5 7 14.4 1
6 8 3 0.2
Adding to Dataset
Adding columns based on values of other columns
Remember that we can use existing columns to create new columns. For example, if we know that each entry is in 50mL of solution:
df_clean <- df_clean %>%
mutate(eggs_count = eggs_per_ml * 50) %>%
mutate(dead_animals_count = dead_animals_per_ml * 50)
df_clean# A tibble: 6 × 5
treatment_number eggs_per_ml dead_animals_per_ml eggs_count dead_animals_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 2 0 0.4 0 20
2 4 3.41 0.4 170. 20
3 5 6.82 0.4 341 20
4 6 3.41 0.2 170. 10
5 7 14.4 1 722. 50
6 8 3 0.2 150 10
What if you want to create a new row based on more than one condition?
Let’s you want another column indicating if an environment appears toxic or not. We can combine the case_when() function that works like an if/then/elseif statement with mutate()
In order to write “if Dead animal density (per ml) is greater than 0.5, then toxic, else if Dead animals (per ml) is less than 0.5, then non-toxic:
df_clean <- df_clean %>%
mutate(toxicity = case_when(
dead_animals_per_ml > 0.5 ~ "toxic",
dead_animals_per_ml < 0.5 ~ "non_toxic"
))
df_clean# A tibble: 6 × 6
treatment_number eggs_per_ml dead_animals_per_ml eggs_count dead_animals_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 2 0 0.4 0 20
2 4 3.41 0.4 170. 20
3 5 6.82 0.4 341 20
4 6 3.41 0.2 170. 10
5 7 14.4 1 722. 50
6 8 3 0.2 150 10
# ℹ 1 more variable: toxicity <chr>
Combining datasets
What if you have another dataset you want to integrate into your dataframe? You will first need to identify if there is a column that can act as a key that can accurately combine your datasets. Datasets with matching keys is an important aspect of relational datasets.
Let’s load in and clean another dataset:
df2_raw <- read_excel("cleaning_raw_data/data.xlsx", sheet = "Sheet3")New names:
• `` -> `...2`
df2_raw# A tibble: 10 × 2
`Table 2` ...2
<chr> <chr>
1 treatment number condition
2 1 treatment
3 2 control
4 3 treatment
5 4 control
6 5 treatment
7 6 control
8 7 treatment
9 8 control
10 9 treatment
df2_clean <- df2_raw
# Use first row as column names
names(df2_clean) <- df2_raw %>%
filter(row_number() == 1)
df2_clean# A tibble: 10 × 2
`treatment number` condition
<chr> <chr>
1 treatment number condition
2 1 treatment
3 2 control
4 3 treatment
5 4 control
6 5 treatment
7 6 control
8 7 treatment
9 8 control
10 9 treatment
df2_clean <- df2_clean %>%
filter(row_number() != 1) %>%
clean_names()
df2_clean# A tibble: 9 × 2
treatment_number condition
<chr> <chr>
1 1 treatment
2 2 control
3 3 treatment
4 4 control
5 5 treatment
6 6 control
7 7 treatment
8 8 control
9 9 treatment
Now we can join our datasets. Be sure to inspect your dataset to decide which type of join is appropriate based on the structure of your dataset:
We want an inner-join. If we did a right, left, or full join, we would end up with NAs because we removed treatment 1 from the our first dataset.
df_combined_clean <- inner_join(x = df_clean, y = df2_clean, by = "treatment_number")
df_combined_clean# A tibble: 6 × 7
treatment_number eggs_per_ml dead_animals_per_ml eggs_count dead_animals_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 2 0 0.4 0 20
2 4 3.41 0.4 170. 20
3 5 6.82 0.4 341 20
4 6 3.41 0.2 170. 10
5 7 14.4 1 722. 50
6 8 3 0.2 150 10
# ℹ 2 more variables: toxicity <chr>, condition <chr>