Cleaning Data Tips

Authors

Charles Lehnen

Cameron Egan

Charles’ Data Cleaning Tips


Principles

  • Remember the 2 steps of environment set-up:
  1. Clear your environment

    • This ensures that when you re-run the whole script, data and dataframes from other scripts will not interfere with your current script.
rm(list = ls())
  1. Load your packages
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
  • Don’t modify your file (.csv, .rds, .RData, .xlsx, etc.) directly.

    • It should be kept in its raw form. This is so that your workflow in R is reproducible by someone else if they download the same dataset. Instead, modify the contents through your data cleaning in R.
  • When importing your data, import the raw data as its own data frame that you do not modify further.

    • When cleaning, assign that to a new variable. That way when you re-run all of your code, if you modified the data frame throughout, it will always start your analysis with the raw data like it should.
df_raw <- data.frame()

df_clean <- df_raw %>%
  filter() %>%
  select()
  • When naming your own columns, avoid spaces. Use underscores ( _ ) instead

  • When naming your own files or folders, avoid spaces. Use underscores ( _ ) instead

  • Remember to be an explicit programmer when naming objects!

    • Throughout this handout, I use object generic names like df_raw and df_clean. For your actual datasets you should give them meaningful names instead of df that help describe the data!
  • Remember the 3 steps of preparing data:

    1. Import data

    2. Inspect data

    3. Clean data


Importing data

Can’t find your file?


df_raw <- read.csv(file = "data.csv")

If you receive an error like:

Error in file(file, "rt") : cannot open the connection

This means the the function cannot locate your file

  1. Step 1: Ensure the File is in the Correct Location

    • If you use the code below, This tells R to look for the "data.csv" file in the same folder as your script. Make sure the file is in the correct folder:
df_raw <- read.csv(file = "data.csv")

Note, if you want to explicitly indicate the current folder where the .qmd file is located, you can use ./:

df_raw <- read.csv(file = "./data.csv")

This is functionally equivalent to omitting ./, as R assumes the current folder by default.

Step 2: Specify a Different Folder

First, check the location of your working directory, which is the folder your script is currently looking for files in:

getwd()
[1] "C:/Users/Charles/Documents/GitHub/BISC_404_Ecology_and_Biodiversity/Labs/Lab_3"

Note: if this doesn’t match your expectations, you can use setwd() to set the correct folder

If your "data.csv" file is not in the same folder as your script file, you can direct R to look in another folder. For example:

  • If your script file is in a folder called Website_files, and the "data.csv" file is in a subfolder called cleaning_raw_data, you can specify the path like this:
data_raw <- read.csv(file = "cleaning_raw_data/data.csv")
  • If your "data.csv" file is in the parent folder (the folder that contains Lab_3), use ../ to indicate “go up one folder”:
data_raw <- read.csv(file = "../cleaning_raw_data/data.csv")

Key Takeaways

  • Use ./ to indicate the current folder (though it is usually implied by default).
  • Use ../ to move up a folder in the directory hierarchy.
  • Use <subfolder_name>/ to specify a subfolder within the current folder.

Double-check your file’s location and path to ensure successful loading

Importing R Data Files

`.Rds` and `.Rda` files in R

.rds and .rda (or .Rdata) files are used to store data natively in R.

The main advantage of .rda files is that they can contain a variety of data in a single file (objects, plots, functions, packages, etc.), however when loading a .rda file you run the risk of over-writing existing objects, plots, etc. with the same name.

.rds files can only contain a single data structure. A benefit of this file type is that loading a .rds file will not overwrite data structures with the same name.

df_raw <- readRDS(file = "cleaning_raw_data/data.rds")

df_raw
# A tibble: 7 × 3
  `eggs (per ml)` `dead animals (per ml)` `external medium (mu mol N / l)`
            <dbl>                   <dbl>                            <dbl>
1           NA                       NA                                 80
2            0                        0.4                               80
3            7.02                     0                                 80
4            3.41                     0.4                               80
5            6.82                     0.4                               80
6            3.41                     0.2                               80
7           14.4                      1                                 80

Importing a CSV

data_raw <- read.csv(file = "cleaning_raw_data/data.csv")

data_raw
  eggs..per.ml. dead.animals..per.ml. external.medium..mu.mol.N...l.
1            NA                    NA                             80
2          0.00                   0.4                             80
3          7.02                   0.0                             80
4          3.41                   0.4                             80
5          6.82                   0.4                             80
6          3.41                   0.2                             80
7         14.45                   1.0                             80

Odd column name formatting from importing your .csv?

library(readr)  

df_raw <- read_csv(file = "cleaning_raw_data/data.csv") 
Rows: 7 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): eggs (per ml), dead animals (per ml), external medium (mu mol N / l)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_raw
# A tibble: 7 × 3
  `eggs (per ml)` `dead animals (per ml)` `external medium (mu mol N / l)`
            <dbl>                   <dbl>                            <dbl>
1           NA                       NA                                 80
2            0                        0.4                               80
3            7.02                     0                                 80
4            3.41                     0.4                               80
5            6.82                     0.4                               80
6            3.41                     0.2                               80
7           14.4                      1                                 80

Importing CSV Troubleshooting

df_raw <- read_csv(file = "cleaning_raw_data/data.csv.txt")
Rows: 9 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): eggs (per ml);dead animals (per ml);external medium (mu mol N / l)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_raw
# A tibble: 9 × 1
  `eggs (per ml);dead animals (per ml);external medium (mu mol N / l)`
  <chr>                                                               
1 NA;NA;80                                                            
2 0;0.4;80                                                            
3 7.02;0;80                                                           
4 3.41;0.4;80                                                         
5 6.82;0.4;80                                                         
6 3.41;0.2;80                                                         
7 14.45;1;80                                                          
8 22;2;NA                                                             
9 0;-16;80                                                            

Why is this .csv file not loading correctly?

Check the delimiters. Remember that .csv stands for “comma-separated file,” but this file looks like it has semi-colons (;) as separators. We can use read_delim() for ultimate flexibility.

library(readr)

df_raw <- read_delim(file = "cleaning_raw_data/data.csv.txt", delim = ";")
Rows: 9 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (3): eggs (per ml), dead animals (per ml), external medium (mu mol N / l)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_raw
# A tibble: 9 × 3
  `eggs (per ml)` `dead animals (per ml)` `external medium (mu mol N / l)`
            <dbl>                   <dbl>                            <dbl>
1           NA                       NA                                 80
2            0                        0.4                               80
3            7.02                     0                                 80
4            3.41                     0.4                               80
5            6.82                     0.4                               80
6            3.41                     0.2                               80
7           14.4                      1                                 80
8           22                        2                                 NA
9            0                      -16                                 80

Importing an Excel file

library(readxl)

df_raw <- read_excel("cleaning_raw_data/data.xlsx")

df_raw
# A tibble: 0 × 1
# ℹ 1 variable: blank <lgl>

Data missing? Check the sheet

# Either by number
df_raw <- read_excel("cleaning_raw_data/data.xlsx", sheet = 2)
New names:
• `` -> `...1`
• `` -> `...3`
• `` -> `...4`
df_raw
# A tibble: 10 × 4
   ...1             `Table 1`     ...3                  ...4                    
   <chr>            <chr>         <chr>                 <chr>                   
 1 treatment number eggs (per ml) dead animals (per ml) external medium (mu mol…
 2 1                NA            NA                    80                      
 3 2                0             0.4                   80                      
 4 3                7.02          0                     80                      
 5 4                3.41          0.4                   80                      
 6 5                6.82          0.4                   80                      
 7 6                3.41          0.2                   80                      
 8 7                14.45         1                     80                      
 9 8                3             0.2                   NA                      
10 9                4.3           -16                   80                      
# Or by name

df_raw <- read_excel("cleaning_raw_data/data.xlsx", sheet = "Sheet2")
New names:
• `` -> `...1`
• `` -> `...3`
• `` -> `...4`
df_raw
# A tibble: 10 × 4
   ...1             `Table 1`     ...3                  ...4                    
   <chr>            <chr>         <chr>                 <chr>                   
 1 treatment number eggs (per ml) dead animals (per ml) external medium (mu mol…
 2 1                NA            NA                    80                      
 3 2                0             0.4                   80                      
 4 3                7.02          0                     80                      
 5 4                3.41          0.4                   80                      
 6 5                6.82          0.4                   80                      
 7 6                3.41          0.2                   80                      
 8 7                14.45         1                     80                      
 9 8                3             0.2                   NA                      
10 9                4.3           -16                   80                      

Exploring Data

There are many ways to explore your raw data! Here are some examples:

head(df_raw)
# A tibble: 6 × 4
  ...1             `Table 1`     ...3                  ...4                     
  <chr>            <chr>         <chr>                 <chr>                    
1 treatment number eggs (per ml) dead animals (per ml) external medium (mu mol …
2 1                NA            NA                    80                       
3 2                0             0.4                   80                       
4 3                7.02          0                     80                       
5 4                3.41          0.4                   80                       
6 5                6.82          0.4                   80                       
summary(df_raw)
     ...1             Table 1              ...3               ...4          
 Length:10          Length:10          Length:10          Length:10         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
names(df_raw)
[1] "...1"    "Table 1" "...3"    "...4"   
dim(df_raw)
[1] 10  4
str(df_raw)
tibble [10 × 4] (S3: tbl_df/tbl/data.frame)
 $ ...1   : chr [1:10] "treatment number" "1" "2" "3" ...
 $ Table 1: chr [1:10] "eggs (per ml)" "NA" "0" "7.02" ...
 $ ...3   : chr [1:10] "dead animals (per ml)" "NA" "0.4" "0" ...
 $ ...4   : chr [1:10] "external medium (mu mol N / l)" "80" "80" "80" ...

And here are some rough, exploratory plot examples. Note I am just using them to explore the raw data, so did not include labels like you should in your real analysis.

ggplot(data = df_raw, aes(x = as.numeric(`Table 1`), y = as.numeric(`...3`))) +
  geom_point() +
  labs(x = "Eggs (per ml)", y = "Dead animals (per ml)")
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(data = df_raw, aes(x = as.numeric(`Table 1`), y = as.numeric(`...3`))) +
  geom_point() +
  geom_smooth() +
  labs(x = "Eggs (per ml)", y = "Dead animals (per ml)")
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(data = df_raw, aes(x = as.numeric(`Table 1`))) +
  geom_histogram() +
  labs(x = "Eggs (per ml)")
Warning in FUN(X[[i]], ...): NAs introduced by coercion
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

ggplot(data = df_raw, aes(x = as.numeric(`...3`))) +
  geom_histogram() +
  labs(x = "Dead animals (per ml)")
Warning in FUN(X[[i]], ...): NAs introduced by coercion
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

ggplot(data = df_raw, aes(x = seq_along(as.numeric(`Table 1`)), y = as.numeric(`Table 1`))) +
  geom_line() +
  labs(y = "Eggs (per ml)")
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

ggplot(data = df_raw, aes(x = seq_along(as.numeric(`...3`)), y = as.numeric(`...3`))) +
  geom_line() +
  labs(y = "Dead animals (per ml)")
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

ggplot(data = df_raw, aes(x = "", y = as.numeric(`Table 1`))) +
  geom_boxplot() +
  geom_jitter() +
  labs(y = "Eggs (per ml)")
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(data = df_raw, aes(x = "", y = as.numeric(`...3`))) +
  geom_boxplot() +
  geom_jitter() +
  labs(y = "Dead animals (per ml)")
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning in FUN(X[[i]], ...): NAs introduced by coercion
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

But your data exploration does not stop at running these functions! You now must critically look at the results and interpret what they mean. Then clearly describe what you are seeing to yourself.


Cleaning Data

Cleaning column names

Now how do you reference these column names? With backticks ( ` ) like we did above:

ggplot(data = df_raw, aes(y = `Table 1`)) +
  geom_bar()

Your titles are in the incorrect row because of spreadsheet formatting?

df_clean <- df_raw

df_clean
# A tibble: 10 × 4
   ...1             `Table 1`     ...3                  ...4                    
   <chr>            <chr>         <chr>                 <chr>                   
 1 treatment number eggs (per ml) dead animals (per ml) external medium (mu mol…
 2 1                NA            NA                    80                      
 3 2                0             0.4                   80                      
 4 3                7.02          0                     80                      
 5 4                3.41          0.4                   80                      
 6 5                6.82          0.4                   80                      
 7 6                3.41          0.2                   80                      
 8 7                14.45         1                     80                      
 9 8                3             0.2                   NA                      
10 9                4.3           -16                   80                      
# Use first row as column names
names(df_clean) <- df_raw %>%
  filter(row_number() == 1)
Warning: The `value` argument of `names<-()` must be a character vector as of tibble
3.0.0.
df_clean
# A tibble: 10 × 4
   `treatment number` `eggs (per ml)` `dead animals (per ml)`
   <chr>              <chr>           <chr>                  
 1 treatment number   eggs (per ml)   dead animals (per ml)  
 2 1                  NA              NA                     
 3 2                  0               0.4                    
 4 3                  7.02            0                      
 5 4                  3.41            0.4                    
 6 5                  6.82            0.4                    
 7 6                  3.41            0.2                    
 8 7                  14.45           1                      
 9 8                  3               0.2                    
10 9                  4.3             -16                    
# ℹ 1 more variable: `external medium (mu mol N / l)` <chr>
# Filter out first row 
df_clean <- df_clean %>%
  filter(row_number() != 1) 

df_clean
# A tibble: 9 × 4
  `treatment number` `eggs (per ml)` `dead animals (per ml)`
  <chr>              <chr>           <chr>                  
1 1                  NA              NA                     
2 2                  0               0.4                    
3 3                  7.02            0                      
4 4                  3.41            0.4                    
5 5                  6.82            0.4                    
6 6                  3.41            0.2                    
7 7                  14.45           1                      
8 8                  3               0.2                    
9 9                  4.3             -16                    
# ℹ 1 more variable: `external medium (mu mol N / l)` <chr>

How do you replace these spaces with underscores to make the columns easier to works with?

  1. Option 1: One at a time (manually):
df_clean1 <- df_clean %>%
  rename(eggs_per_ml = `eggs (per ml)`)

df_clean1
# A tibble: 9 × 4
  `treatment number` eggs_per_ml `dead animals (per ml)` external medium (mu m…¹
  <chr>              <chr>       <chr>                   <chr>                  
1 1                  NA          NA                      80                     
2 2                  0           0.4                     80                     
3 3                  7.02        0                       80                     
4 4                  3.41        0.4                     80                     
5 5                  6.82        0.4                     80                     
6 6                  3.41        0.2                     80                     
7 7                  14.45       1                       80                     
8 8                  3           0.2                     NA                     
9 9                  4.3         -16                     80                     
# ℹ abbreviated name: ¹​`external medium (mu mol N / l)`

We also need to get rid of the other troublesome punctuation, like the parentheses, because they will mess with our functions. Imagine if we tried to run the following. How would R try to interpret that?

df_clean <- df_raw %>%
  rename(external_medium_(mu_mol_N_/_l) = `external medium (mu mol N / l)`)
Error in parse(text = input): <text>:2:38: unexpected symbol
1: df_clean <- df_raw %>%
2:   rename(external_medium_(mu_mol_N_/_l
                                        ^
  1. Option 2: Among other ways to accomplish removing all troublesome punctuation, here are some options:
df_clean2 <- df_clean %>%
  rename_with(~ gsub("[^A-Za-z0-9]", "_", .x))

df_clean2
# A tibble: 9 × 4
  treatment_number eggs__per_ml_ dead_animals__per_ml_ external_medium__mu_mol…¹
  <chr>            <chr>         <chr>                 <chr>                    
1 1                NA            NA                    80                       
2 2                0             0.4                   80                       
3 3                7.02          0                     80                       
4 4                3.41          0.4                   80                       
5 5                6.82          0.4                   80                       
6 6                3.41          0.2                   80                       
7 7                14.45         1                     80                       
8 8                3             0.2                   NA                       
9 9                4.3           -16                   80                       
# ℹ abbreviated name: ¹​external_medium__mu_mol_N___l_
library(stringr)

df_clean3 <- df_clean %>%
  rename_with(~ str_replace_all(.x, "[^A-Za-z0-9]", "_")) 

df_clean3
# A tibble: 9 × 4
  treatment_number eggs__per_ml_ dead_animals__per_ml_ external_medium__mu_mol…¹
  <chr>            <chr>         <chr>                 <chr>                    
1 1                NA            NA                    80                       
2 2                0             0.4                   80                       
3 3                7.02          0                     80                       
4 4                3.41          0.4                   80                       
5 5                6.82          0.4                   80                       
6 6                3.41          0.2                   80                       
7 7                14.45         1                     80                       
8 8                3             0.2                   NA                       
9 9                4.3           -16                   80                       
# ℹ abbreviated name: ¹​external_medium__mu_mol_N___l_
library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
df_clean <- df_clean %>%
  clean_names() 

df_clean
# A tibble: 9 × 4
  treatment_number eggs_per_ml dead_animals_per_ml external_medium_mu_mol_n_l
  <chr>            <chr>       <chr>               <chr>                     
1 1                NA          NA                  80                        
2 2                0           0.4                 80                        
3 3                7.02        0                   80                        
4 4                3.41        0.4                 80                        
5 5                6.82        0.4                 80                        
6 6                3.41        0.2                 80                        
7 7                14.45       1                   80                        
8 8                3           0.2                 NA                        
9 9                4.3         -16                 80                        

Assorted cleaning

As you explore your dataset, keep an eye out for potential red flags that may indicate data errors. Here are some key issues to check for:

  • NAs: NA (not applicable) or NaN (not a number) values

  • Outliers: Values that are far from the mean may indicate errors. Check using summary(), scatter plots, histograms.

  • Numbers that don’t make sense:

    • Can there be negative counts of animals? Probably not!

    • If your study covers 2012-2017, but there’s a data point from 1900, it’s likely a data error.

    • If your study is in North America, but you see a point at lat/lon (0,0), it could be a coordinate entry error.

    • Percentages over 100%

    • Etc.

  • Incorrect Data Types: Remember that incorrect datatypes can cause analysis errors, check the datatypes of every column to make sure they are what you want

  • Dates in Incorrect Format: Use the lubridate package’s formulas to convert to date data object. For example, if input data in format of month/day/year then use mutate(<new date column name> = mdy(<existing date column name>)) in your cleaning pipe

What red flags do you see from our current dataset?

summary(df_clean)
 treatment_number   eggs_per_ml        dead_animals_per_ml
 Length:9           Length:9           Length:9           
 Class :character   Class :character   Class :character   
 Mode  :character   Mode  :character   Mode  :character   
 external_medium_mu_mol_n_l
 Length:9                  
 Class :character          
 Mode  :character          

Should these be character/string data types?

df_clean <- df_clean %>%
  mutate(eggs_per_ml = as.numeric(eggs_per_ml)) %>%
  mutate(dead_animals_per_ml = as.numeric(dead_animals_per_ml)) %>%
  mutate(external_medium_mu_mol_n_l = as.numeric(external_medium_mu_mol_n_l))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `eggs_per_ml = as.numeric(eggs_per_ml)`.
Caused by warning:
! NAs introduced by coercion
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `dead_animals_per_ml = as.numeric(dead_animals_per_ml)`.
Caused by warning:
! NAs introduced by coercion
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `external_medium_mu_mol_n_l =
  as.numeric(external_medium_mu_mol_n_l)`.
Caused by warning:
! NAs introduced by coercion
summary(df_clean)
 treatment_number    eggs_per_ml     dead_animals_per_ml
 Length:9           Min.   : 0.000   Min.   :-16.000    
 Class :character   1st Qu.: 3.308   1st Qu.:  0.150    
 Mode  :character   Median : 3.855   Median :  0.300    
                    Mean   : 5.301   Mean   : -1.675    
                    3rd Qu.: 6.870   3rd Qu.:  0.400    
                    Max.   :14.450   Max.   :  1.000    
                    NA's   :1        NA's   :1          
 external_medium_mu_mol_n_l
 Min.   :80                
 1st Qu.:80                
 Median :80                
 Mean   :80                
 3rd Qu.:80                
 Max.   :80                
 NA's   :1                 

Does -16 make sense for a value for density of dead animals?

df_clean <- df_clean %>%
  filter(dead_animals_per_ml > 0)


summary(df_clean)
 treatment_number    eggs_per_ml     dead_animals_per_ml
 Length:6           Min.   : 0.000   Min.   :0.2000     
 Class :character   1st Qu.: 3.103   1st Qu.:0.2500     
 Mode  :character   Median : 3.410   Median :0.4000     
                    Mean   : 5.182   Mean   :0.4333     
                    3rd Qu.: 5.968   3rd Qu.:0.4000     
                    Max.   :14.450   Max.   :1.0000     
                                                        
 external_medium_mu_mol_n_l
 Min.   :80                
 1st Qu.:80                
 Median :80                
 Mean   :80                
 3rd Qu.:80                
 Max.   :80                
 NA's   :1                 
Removing NAs
  1. First check for NAs
summary(df_clean)
 treatment_number    eggs_per_ml     dead_animals_per_ml
 Length:6           Min.   : 0.000   Min.   :0.2000     
 Class :character   1st Qu.: 3.103   1st Qu.:0.2500     
 Mode  :character   Median : 3.410   Median :0.4000     
                    Mean   : 5.182   Mean   :0.4333     
                    3rd Qu.: 5.968   3rd Qu.:0.4000     
                    Max.   :14.450   Max.   :1.0000     
                                                        
 external_medium_mu_mol_n_l
 Min.   :80                
 1st Qu.:80                
 Median :80                
 Mean   :80                
 3rd Qu.:80                
 Max.   :80                
 NA's   :1                 
  1. Then you need to decide:
  • Option 1) Are you going to remove allrows that have NAs?
    • Do this only if this will not eliminate data we are interested in
library(tidyr)

df_clean1 <- df_clean %>%
  clean_names() %>%
  drop_na()

df_clean1
# A tibble: 5 × 4
  treatment_number eggs_per_ml dead_animals_per_ml external_medium_mu_mol_n_l
  <chr>                  <dbl>               <dbl>                      <dbl>
1 2                       0                    0.4                         80
2 4                       3.41                 0.4                         80
3 5                       6.82                 0.4                         80
4 6                       3.41                 0.2                         80
5 7                      14.4                  1                           80

Note: this was not the right choice in our case because we lost the egg and dead animal density data we are interested in from this column because of the NA in the media column! As long as we assume this NA was a data error and it had the same amount of media as the rest, we should preserve that data for our analysis.

  • Option 2) Are you going to remove certain rows that have NAs?

    • Choose this if you want to specify each individual column you are interested in
# If we are only interested in analyzing the "eggs_per_ml" and "dead_animals_per_ml" columns
df_clean2 <- df_clean %>%
  clean_names() %>%
  filter(!is.na(eggs_per_ml)) %>%
  filter(!is.na(dead_animals_per_ml))

df_clean2
# A tibble: 6 × 4
  treatment_number eggs_per_ml dead_animals_per_ml external_medium_mu_mol_n_l
  <chr>                  <dbl>               <dbl>                      <dbl>
1 2                       0                    0.4                         80
2 4                       3.41                 0.4                         80
3 5                       6.82                 0.4                         80
4 6                       3.41                 0.2                         80
5 7                      14.4                  1                           80
6 8                       3                    0.2                         NA
  • Option 3) Are you going to remove columns that have NAs
    • This is a safe option to choose certain columns are not important to your analysis
# Remove the column we are not interested in
df_clean <- df_clean %>%
  select(-external_medium_mu_mol_n_l)

df_clean
# A tibble: 6 × 3
  treatment_number eggs_per_ml dead_animals_per_ml
  <chr>                  <dbl>               <dbl>
1 2                       0                    0.4
2 4                       3.41                 0.4
3 5                       6.82                 0.4
4 6                       3.41                 0.2
5 7                      14.4                  1  
6 8                       3                    0.2
Adding to Dataset
Adding columns based on values of other columns

Remember that we can use existing columns to create new columns. For example, if we know that each entry is in 50mL of solution:

df_clean <- df_clean %>%
  mutate(eggs_count = eggs_per_ml * 50) %>% 
  mutate(dead_animals_count = dead_animals_per_ml * 50) 

df_clean
# A tibble: 6 × 5
  treatment_number eggs_per_ml dead_animals_per_ml eggs_count dead_animals_count
  <chr>                  <dbl>               <dbl>      <dbl>              <dbl>
1 2                       0                    0.4         0                  20
2 4                       3.41                 0.4       170.                 20
3 5                       6.82                 0.4       341                  20
4 6                       3.41                 0.2       170.                 10
5 7                      14.4                  1         722.                 50
6 8                       3                    0.2       150                  10

What if you want to create a new row based on more than one condition?

Let’s you want another column indicating if an environment appears toxic or not. We can combine the case_when() function that works like an if/then/elseif statement with mutate()

In order to write “if Dead animal density (per ml) is greater than 0.5, then toxic, else if Dead animals (per ml) is less than 0.5, then non-toxic:

df_clean <- df_clean %>%
  mutate(toxicity = case_when(
    dead_animals_per_ml > 0.5 ~ "toxic",
    dead_animals_per_ml < 0.5 ~ "non_toxic"
  ))

df_clean
# A tibble: 6 × 6
  treatment_number eggs_per_ml dead_animals_per_ml eggs_count dead_animals_count
  <chr>                  <dbl>               <dbl>      <dbl>              <dbl>
1 2                       0                    0.4         0                  20
2 4                       3.41                 0.4       170.                 20
3 5                       6.82                 0.4       341                  20
4 6                       3.41                 0.2       170.                 10
5 7                      14.4                  1         722.                 50
6 8                       3                    0.2       150                  10
# ℹ 1 more variable: toxicity <chr>
Combining datasets

What if you have another dataset you want to integrate into your dataframe? You will first need to identify if there is a column that can act as a key that can accurately combine your datasets. Datasets with matching keys is an important aspect of relational datasets.

Let’s load in and clean another dataset:

df2_raw <- read_excel("cleaning_raw_data/data.xlsx", sheet = "Sheet3")
New names:
• `` -> `...2`
df2_raw
# A tibble: 10 × 2
   `Table 2`        ...2     
   <chr>            <chr>    
 1 treatment number condition
 2 1                treatment
 3 2                control  
 4 3                treatment
 5 4                control  
 6 5                treatment
 7 6                control  
 8 7                treatment
 9 8                control  
10 9                treatment
df2_clean <- df2_raw

# Use first row as column names
names(df2_clean) <- df2_raw %>%
  filter(row_number() == 1)
  
df2_clean
# A tibble: 10 × 2
   `treatment number` condition
   <chr>              <chr>    
 1 treatment number   condition
 2 1                  treatment
 3 2                  control  
 4 3                  treatment
 5 4                  control  
 6 5                  treatment
 7 6                  control  
 8 7                  treatment
 9 8                  control  
10 9                  treatment
df2_clean <- df2_clean %>%
  filter(row_number() != 1) %>%
  clean_names() 

df2_clean
# A tibble: 9 × 2
  treatment_number condition
  <chr>            <chr>    
1 1                treatment
2 2                control  
3 3                treatment
4 4                control  
5 5                treatment
6 6                control  
7 7                treatment
8 8                control  
9 9                treatment

Now we can join our datasets. Be sure to inspect your dataset to decide which type of join is appropriate based on the structure of your dataset:

We want an inner-join. If we did a right, left, or full join, we would end up with NAs because we removed treatment 1 from the our first dataset.

df_combined_clean <- inner_join(x = df_clean, y = df2_clean, by = "treatment_number")

df_combined_clean
# A tibble: 6 × 7
  treatment_number eggs_per_ml dead_animals_per_ml eggs_count dead_animals_count
  <chr>                  <dbl>               <dbl>      <dbl>              <dbl>
1 2                       0                    0.4         0                  20
2 4                       3.41                 0.4       170.                 20
3 5                       6.82                 0.4       341                  20
4 6                       3.41                 0.2       170.                 10
5 7                      14.4                  1         722.                 50
6 8                       3                    0.2       150                  10
# ℹ 2 more variables: toxicity <chr>, condition <chr>