Cleaning

Artwork by Allison Horst

Cleaning up

We’re going to keep using the data from the “Looking at your data” section, but now we’re going to clean it. The big functions we’ll use are:

  • clean_names(): cleans up column names
  • mutate(): creates new columns, changes columns (very powerful when used with case_when())
  • select(): selects columns from a data frame
  • pivot_longer(): puts the data frame in “long format” (each row is an observation)
  • rename(): renames columns
  • filter(): filters data frame

The original data frame

Just to remind ourselves, this is what the original data frame looks like:

sample_n(
  global_catch,
  size = 10
)
# A tibble: 10 × 8
   Entity Code      Year `Artisanal (small-scale commercial)` Discards `Industrial (large-scale commercial)` Recreational Subsistence
   <chr>  <chr>    <dbl>                                <dbl>    <dbl>                                 <dbl>        <dbl>       <dbl>
 1 World  OWID_WRL  1968                             12303383 13871018                              55617156       460158     3916270
 2 World  OWID_WRL  2001                             20046446 11506452                              84201719       769491     3955601
 3 World  OWID_WRL  1955                              9545554  8340179                              18728089       313420     3012167
 4 World  OWID_WRL  1963                             11005246 12332187                              36093501       392172     3826997
 5 World  OWID_WRL  1952                              8272109  7230311                              16463942       293558     2728141
 6 World  OWID_WRL  1960                             10703245 10896146                              25853941       360842     3479410
 7 World  OWID_WRL  1983                             15300068 12388473                              65689361       849021     4200135
 8 World  OWID_WRL  1973                             13688372 12031176                              51308375       548785     3957258
 9 World  OWID_WRL  1974                             13309012 11416434                              56720856       583018     3866687
10 World  OWID_WRL  1997                             20401848 14321828                              88205198       774972     4044958

Making column names nicer

Function: clean_names()
Package: janitor

global_catch %>% 
  # new function: clean_names
  # makes the column names nicer! 
  # compare this with the column names from `colnames(global_catch)` output
  clean_names()
# A tibble: 61 × 8
   entity code      year artisanal_small_scale_commercial discards industrial_large_scale_commercial recreational subsistence
   <chr>  <chr>    <dbl>                            <dbl>    <dbl>                             <dbl>        <dbl>       <dbl>
 1 World  OWID_WRL  1950                          7526795  5874170                          14566338       268260     2677833
 2 World  OWID_WRL  1951                          8278304  6278225                          15417937       284319     2704471
 3 World  OWID_WRL  1952                          8272109  7230311                          16463942       293558     2728141
 4 World  OWID_WRL  1953                          8469284  7172937                          17163789       292070     2753098
 5 World  OWID_WRL  1954                          9226926  8012930                          18340199       304398     2895153
 6 World  OWID_WRL  1955                          9545554  8340179                          18728089       313420     3012167
 7 World  OWID_WRL  1956                         10303408  8692551                          19902604       319333     3058523
 8 World  OWID_WRL  1957                         10425695  8998732                          20030089       339291     3112506
 9 World  OWID_WRL  1958                         10172920  9255992                          20520801       353633     3187292
10 World  OWID_WRL  1959                         10385711  9908003                          23422702       355360     3296620
# ℹ 51 more rows

Creating new columns

Function: mutate()
Package: dplyr (in tidyverse)

In this line of code, we’re calculating catch in million tons.

global_catch %>% 
  clean_names() %>% # same as above

  # new function: mutate
  # create new columns to calculate catch divided by 1000000
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) 
# A tibble: 61 × 11
   entity code      year artisanal_small_scale_commercial discards industrial_large_scale_commercial recreational subsistence artisanal_mil_tons industrial_mil_tons subsistence_mil_tons
   <chr>  <chr>    <dbl>                            <dbl>    <dbl>                             <dbl>        <dbl>       <dbl>              <dbl>               <dbl>                <dbl>
 1 World  OWID_WRL  1950                          7526795  5874170                          14566338       268260     2677833               7.53                14.6                 2.68
 2 World  OWID_WRL  1951                          8278304  6278225                          15417937       284319     2704471               8.28                15.4                 2.70
 3 World  OWID_WRL  1952                          8272109  7230311                          16463942       293558     2728141               8.27                16.5                 2.73
 4 World  OWID_WRL  1953                          8469284  7172937                          17163789       292070     2753098               8.47                17.2                 2.75
 5 World  OWID_WRL  1954                          9226926  8012930                          18340199       304398     2895153               9.23                18.3                 2.90
 6 World  OWID_WRL  1955                          9545554  8340179                          18728089       313420     3012167               9.55                18.7                 3.01
 7 World  OWID_WRL  1956                         10303408  8692551                          19902604       319333     3058523              10.3                 19.9                 3.06
 8 World  OWID_WRL  1957                         10425695  8998732                          20030089       339291     3112506              10.4                 20.0                 3.11
 9 World  OWID_WRL  1958                         10172920  9255992                          20520801       353633     3187292              10.2                 20.5                 3.19
10 World  OWID_WRL  1959                         10385711  9908003                          23422702       355360     3296620              10.4                 23.4                 3.30
# ℹ 51 more rows

Selecting columns of interest

Function: select()
Package: dplyr (in tidyverse)

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% # same as above
  
  # new function: select
  # select the columns you're interested in 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons)
# A tibble: 61 × 4
    year artisanal_mil_tons industrial_mil_tons subsistence_mil_tons
   <dbl>              <dbl>               <dbl>                <dbl>
 1  1950               7.53                14.6                 2.68
 2  1951               8.28                15.4                 2.70
 3  1952               8.27                16.5                 2.73
 4  1953               8.47                17.2                 2.75
 5  1954               9.23                18.3                 2.90
 6  1955               9.55                18.7                 3.01
 7  1956              10.3                 19.9                 3.06
 8  1957              10.4                 20.0                 3.11
 9  1958              10.2                 20.5                 3.19
10  1959              10.4                 23.4                 3.30
# ℹ 51 more rows

Putting the data frame in long format

Function: pivot_longer()
Package: dplyr (in tidyverse)

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% # same as above
  
  # new function: pivot_longer
  # put the data frame in long format: each row is an observation
  # in this case, each row is a fishery with some catch (in million tons) in a given year
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons)
# A tibble: 183 × 3
    year name                 value
   <dbl> <chr>                <dbl>
 1  1950 artisanal_mil_tons    7.53
 2  1950 industrial_mil_tons  14.6 
 3  1950 subsistence_mil_tons  2.68
 4  1951 artisanal_mil_tons    8.28
 5  1951 industrial_mil_tons  15.4 
 6  1951 subsistence_mil_tons  2.70
 7  1952 artisanal_mil_tons    8.27
 8  1952 industrial_mil_tons  16.5 
 9  1952 subsistence_mil_tons  2.73
10  1953 artisanal_mil_tons    8.47
# ℹ 173 more rows
Compare this with the output from the “Creating new columns” section

There are still 3 columns in this data frame, but now there’s a column called name and another called value. Each row is the catch in million tons for a fisher (either artisanal, industrial, or subsistence) in a given year.

Renaming columns

Function: rename()
Package: dplyr (in tidyverse)

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% 
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>% # same as above
  
  # new function: rename
  # renames columns so that they are easier to understand
  # arguments: "new name" = "old name"
  rename(catch_mil = value,
         fishery_type = name) 
# A tibble: 183 × 3
    year fishery_type         catch_mil
   <dbl> <chr>                    <dbl>
 1  1950 artisanal_mil_tons        7.53
 2  1950 industrial_mil_tons      14.6 
 3  1950 subsistence_mil_tons      2.68
 4  1951 artisanal_mil_tons        8.28
 5  1951 industrial_mil_tons      15.4 
 6  1951 subsistence_mil_tons      2.70
 7  1952 artisanal_mil_tons        8.27
 8  1952 industrial_mil_tons      16.5 
 9  1952 subsistence_mil_tons      2.73
10  1953 artisanal_mil_tons        8.47
# ℹ 173 more rows

Creating a new column based on an existing column

Function: case_when() and mutate()
Package: dplyr (in tidyverse)

This creates a new column with the “full name” for each fishery!

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% 
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>% 
  rename(catch_mil = value,
         fishery_type = name) %>% # same as above
  
  # new function: mutate with case_when
  # creates new column of full names for fisheries
  mutate(fishery_full_name = case_when(
    # if the fishery_type is artisanal, then name it "Artisanal fishery"
    fishery_type == "artisanal_mil_tons" ~ "Artisanal fishery", 
    # if the fishery_type is industrial, then name it "Industrial fishery"
    fishery_type == "industrial_mil_tons" ~ "Industrial fishery",
    # if the fishery_type is subsistence, then name it "Subsistence fishery"
    fishery_type == "subsistence_mil_tons" ~ "Subsistence fishery"
  ))
# A tibble: 183 × 4
    year fishery_type         catch_mil fishery_full_name  
   <dbl> <chr>                    <dbl> <chr>              
 1  1950 artisanal_mil_tons        7.53 Artisanal fishery  
 2  1950 industrial_mil_tons      14.6  Industrial fishery 
 3  1950 subsistence_mil_tons      2.68 Subsistence fishery
 4  1951 artisanal_mil_tons        8.28 Artisanal fishery  
 5  1951 industrial_mil_tons      15.4  Industrial fishery 
 6  1951 subsistence_mil_tons      2.70 Subsistence fishery
 7  1952 artisanal_mil_tons        8.27 Artisanal fishery  
 8  1952 industrial_mil_tons      16.5  Industrial fishery 
 9  1952 subsistence_mil_tons      2.73 Subsistence fishery
10  1953 artisanal_mil_tons        8.47 Artisanal fishery  
# ℹ 173 more rows

Filtering the data frame based on what’s in a column

Function: filter()
Package: dplyr (in tidyverse)

This filters the data frame to only include observations after 1980 in the year column.

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% 
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>% 
  rename(catch_mil = value,
         fishery_type = name) %>% 
  mutate(fishery_full_name = case_when(
    # if the fishery_type is artisanal, then name it "Artisanal fishery"
    fishery_type == "artisanal_mil_tons" ~ "Artisanal fishery", 
    # if the fishery_type is industrial, then name it "Industrial fishery"
    fishery_type == "industrial_mil_tons" ~ "Industrial fishery",
    # if the fishery_type is subsistence, then name it "Subsistence fishery"
    fishery_type == "subsistence_mil_tons" ~ "Subsistence fishery"
  )) %>% # same as above
  
  # new function: filter
  # filters data frame for observations after 1980
  filter(year > 1980)
# A tibble: 90 × 4
    year fishery_type         catch_mil fishery_full_name  
   <dbl> <chr>                    <dbl> <chr>              
 1  1981 artisanal_mil_tons       15.3  Artisanal fishery  
 2  1981 industrial_mil_tons      62.5  Industrial fishery 
 3  1981 subsistence_mil_tons      3.99 Subsistence fishery
 4  1982 artisanal_mil_tons       15.2  Artisanal fishery  
 5  1982 industrial_mil_tons      67.5  Industrial fishery 
 6  1982 subsistence_mil_tons      4.08 Subsistence fishery
 7  1983 artisanal_mil_tons       15.3  Artisanal fishery  
 8  1983 industrial_mil_tons      65.7  Industrial fishery 
 9  1983 subsistence_mil_tons      4.20 Subsistence fishery
10  1984 artisanal_mil_tons       16.1  Artisanal fishery  
# ℹ 80 more rows

Putting it all together

All together to create a data frame called global_catch_clean

global_catch_clean <- global_catch %>% 
  
  # makes the column names nicer
  clean_names() %>% 
  
  # divides catch by 1000000 to calculate catch in million tons
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  
  # selecting columns
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% 
  
  # put the data frame in long format: each row is an observation
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>% 
  
  # renames columns so that they are easier to understand
  rename(catch_mil = value,
         fishery_type = name) %>% 
  
  # creates new column of full names for fisheries
  mutate(fishery_full_name = case_when(
    fishery_type == "artisanal_mil_tons" ~ "Artisanal fishery",
    fishery_type == "industrial_mil_tons" ~ "Industrial fishery",
    fishery_type == "subsistence_mil_tons" ~ "Subsistence fishery"
  )) %>% 
  
  # filters data frame for observations after 1980
  filter(year > 1980)

And just a quick look at this:

sample_n(global_catch_clean,
         size = 10)
# A tibble: 10 × 4
    year fishery_type         catch_mil fishery_full_name  
   <dbl> <chr>                    <dbl> <chr>              
 1  2010 artisanal_mil_tons       21.6  Artisanal fishery  
 2  2005 industrial_mil_tons      83.8  Industrial fishery 
 3  1991 artisanal_mil_tons       17.8  Artisanal fishery  
 4  2003 industrial_mil_tons      81.0  Industrial fishery 
 5  2000 artisanal_mil_tons       19.7  Artisanal fishery  
 6  1999 subsistence_mil_tons      4.06 Subsistence fishery
 7  1998 subsistence_mil_tons      4.03 Subsistence fishery
 8  1994 subsistence_mil_tons      4.14 Subsistence fishery
 9  2009 industrial_mil_tons      75.7  Industrial fishery 
10  2002 subsistence_mil_tons      3.88 Subsistence fishery