Cleaning

Cleaning up

We’re going to keep using the data from the “Looking at your data” section, but now we’re going to clean it. The big functions we’ll use are:

clean_names(): cleans up column names
mutate(): creates new columns, changes columns (very powerful when used with case_when())
select(): selects columns from a data frame
pivot_longer(): puts the data frame in “long format” (each row is an observation)
rename(): renames columns
filter(): filters data frame

The original data frame

Just to remind ourselves, this is what the original data frame looks like:

sample_n(
  global_catch,
  size = 10
)

# A tibble: 10 × 8
   Entity Code      Year `Artisanal (small-scale commercial)` Discards `Industrial (large-scale commercial)` Recreational Subsistence
   <chr>  <chr>    <dbl>                                <dbl>    <dbl>                                 <dbl>        <dbl>       <dbl>
 1 World  OWID_WRL  1968                             12303383 13871018                              55617156       460158     3916270
 2 World  OWID_WRL  2001                             20046446 11506452                              84201719       769491     3955601
 3 World  OWID_WRL  1955                              9545554  8340179                              18728089       313420     3012167
 4 World  OWID_WRL  1963                             11005246 12332187                              36093501       392172     3826997
 5 World  OWID_WRL  1952                              8272109  7230311                              16463942       293558     2728141
 6 World  OWID_WRL  1960                             10703245 10896146                              25853941       360842     3479410
 7 World  OWID_WRL  1983                             15300068 12388473                              65689361       849021     4200135
 8 World  OWID_WRL  1973                             13688372 12031176                              51308375       548785     3957258
 9 World  OWID_WRL  1974                             13309012 11416434                              56720856       583018     3866687
10 World  OWID_WRL  1997                             20401848 14321828                              88205198       774972     4044958

Making column names nicer

Function: clean_names()
Package: janitor

global_catch %>% 
  # new function: clean_names
  # makes the column names nicer! 
  # compare this with the column names from `colnames(global_catch)` output
  clean_names()

# A tibble: 61 × 8
   entity code      year artisanal_small_scale_commercial discards industrial_large_scale_commercial recreational subsistence
   <chr>  <chr>    <dbl>                            <dbl>    <dbl>                             <dbl>        <dbl>       <dbl>
 1 World  OWID_WRL  1950                          7526795  5874170                          14566338       268260     2677833
 2 World  OWID_WRL  1951                          8278304  6278225                          15417937       284319     2704471
 3 World  OWID_WRL  1952                          8272109  7230311                          16463942       293558     2728141
 4 World  OWID_WRL  1953                          8469284  7172937                          17163789       292070     2753098
 5 World  OWID_WRL  1954                          9226926  8012930                          18340199       304398     2895153
 6 World  OWID_WRL  1955                          9545554  8340179                          18728089       313420     3012167
 7 World  OWID_WRL  1956                         10303408  8692551                          19902604       319333     3058523
 8 World  OWID_WRL  1957                         10425695  8998732                          20030089       339291     3112506
 9 World  OWID_WRL  1958                         10172920  9255992                          20520801       353633     3187292
10 World  OWID_WRL  1959                         10385711  9908003                          23422702       355360     3296620
# ℹ 51 more rows

Creating new columns

Function: mutate()
Package: dplyr (in tidyverse)

In this line of code, we’re calculating catch in million tons.

global_catch %>% 
  clean_names() %>% # same as above

  # new function: mutate
  # create new columns to calculate catch divided by 1000000
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000)

# A tibble: 61 × 11
   entity code      year artisanal_small_scale_commercial discards industrial_large_scale_commercial recreational subsistence artisanal_mil_tons industrial_mil_tons subsistence_mil_tons
   <chr>  <chr>    <dbl>                            <dbl>    <dbl>                             <dbl>        <dbl>       <dbl>              <dbl>               <dbl>                <dbl>
 1 World  OWID_WRL  1950                          7526795  5874170                          14566338       268260     2677833               7.53                14.6                 2.68
 2 World  OWID_WRL  1951                          8278304  6278225                          15417937       284319     2704471               8.28                15.4                 2.70
 3 World  OWID_WRL  1952                          8272109  7230311                          16463942       293558     2728141               8.27                16.5                 2.73
 4 World  OWID_WRL  1953                          8469284  7172937                          17163789       292070     2753098               8.47                17.2                 2.75
 5 World  OWID_WRL  1954                          9226926  8012930                          18340199       304398     2895153               9.23                18.3                 2.90
 6 World  OWID_WRL  1955                          9545554  8340179                          18728089       313420     3012167               9.55                18.7                 3.01
 7 World  OWID_WRL  1956                         10303408  8692551                          19902604       319333     3058523              10.3                 19.9                 3.06
 8 World  OWID_WRL  1957                         10425695  8998732                          20030089       339291     3112506              10.4                 20.0                 3.11
 9 World  OWID_WRL  1958                         10172920  9255992                          20520801       353633     3187292              10.2                 20.5                 3.19
10 World  OWID_WRL  1959                         10385711  9908003                          23422702       355360     3296620              10.4                 23.4                 3.30
# ℹ 51 more rows

Selecting columns of interest

Function: select()
Package: dplyr (in tidyverse)

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% # same as above
  
  # new function: select
  # select the columns you're interested in 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons)

# A tibble: 61 × 4
    year artisanal_mil_tons industrial_mil_tons subsistence_mil_tons
   <dbl>              <dbl>               <dbl>                <dbl>
 1  1950               7.53                14.6                 2.68
 2  1951               8.28                15.4                 2.70
 3  1952               8.27                16.5                 2.73
 4  1953               8.47                17.2                 2.75
 5  1954               9.23                18.3                 2.90
 6  1955               9.55                18.7                 3.01
 7  1956              10.3                 19.9                 3.06
 8  1957              10.4                 20.0                 3.11
 9  1958              10.2                 20.5                 3.19
10  1959              10.4                 23.4                 3.30
# ℹ 51 more rows

Putting the data frame in long format

Function: pivot_longer()
Package: dplyr (in tidyverse)

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% # same as above
  
  # new function: pivot_longer
  # put the data frame in long format: each row is an observation
  # in this case, each row is a fishery with some catch (in million tons) in a given year
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons)

# A tibble: 183 × 3
    year name                 value
   <dbl> <chr>                <dbl>
 1  1950 artisanal_mil_tons    7.53
 2  1950 industrial_mil_tons  14.6 
 3  1950 subsistence_mil_tons  2.68
 4  1951 artisanal_mil_tons    8.28
 5  1951 industrial_mil_tons  15.4 
 6  1951 subsistence_mil_tons  2.70
 7  1952 artisanal_mil_tons    8.27
 8  1952 industrial_mil_tons  16.5 
 9  1952 subsistence_mil_tons  2.73
10  1953 artisanal_mil_tons    8.47
# ℹ 173 more rows

Compare this with the output from the “Creating new columns” section

There are still 3 columns in this data frame, but now there’s a column called name and another called value. Each row is the catch in million tons for a fisher (either artisanal, industrial, or subsistence) in a given year.

Renaming columns

Function: rename()
Package: dplyr (in tidyverse)

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% 
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>% # same as above
  
  # new function: rename
  # renames columns so that they are easier to understand
  # arguments: "new name" = "old name"
  rename(catch_mil = value,
         fishery_type = name)

# A tibble: 183 × 3
    year fishery_type         catch_mil
   <dbl> <chr>                    <dbl>
 1  1950 artisanal_mil_tons        7.53
 2  1950 industrial_mil_tons      14.6 
 3  1950 subsistence_mil_tons      2.68
 4  1951 artisanal_mil_tons        8.28
 5  1951 industrial_mil_tons      15.4 
 6  1951 subsistence_mil_tons      2.70
 7  1952 artisanal_mil_tons        8.27
 8  1952 industrial_mil_tons      16.5 
 9  1952 subsistence_mil_tons      2.73
10  1953 artisanal_mil_tons        8.47
# ℹ 173 more rows

Creating a new column based on an existing column

Function: case_when() and mutate()
Package: dplyr (in tidyverse)

This creates a new column with the “full name” for each fishery!

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% 
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>% 
  rename(catch_mil = value,
         fishery_type = name) %>% # same as above
  
  # new function: mutate with case_when
  # creates new column of full names for fisheries
  mutate(fishery_full_name = case_when(
    # if the fishery_type is artisanal, then name it "Artisanal fishery"
    fishery_type == "artisanal_mil_tons" ~ "Artisanal fishery", 
    # if the fishery_type is industrial, then name it "Industrial fishery"
    fishery_type == "industrial_mil_tons" ~ "Industrial fishery",
    # if the fishery_type is subsistence, then name it "Subsistence fishery"
    fishery_type == "subsistence_mil_tons" ~ "Subsistence fishery"
  ))

# A tibble: 183 × 4
    year fishery_type         catch_mil fishery_full_name  
   <dbl> <chr>                    <dbl> <chr>              
 1  1950 artisanal_mil_tons        7.53 Artisanal fishery  
 2  1950 industrial_mil_tons      14.6  Industrial fishery 
 3  1950 subsistence_mil_tons      2.68 Subsistence fishery
 4  1951 artisanal_mil_tons        8.28 Artisanal fishery  
 5  1951 industrial_mil_tons      15.4  Industrial fishery 
 6  1951 subsistence_mil_tons      2.70 Subsistence fishery
 7  1952 artisanal_mil_tons        8.27 Artisanal fishery  
 8  1952 industrial_mil_tons      16.5  Industrial fishery 
 9  1952 subsistence_mil_tons      2.73 Subsistence fishery
10  1953 artisanal_mil_tons        8.47 Artisanal fishery  
# ℹ 173 more rows

Filtering the data frame based on what’s in a column

Function: filter()
Package: dplyr (in tidyverse)

This filters the data frame to only include observations after 1980 in the year column.

global_catch %>% 
  clean_names() %>% 
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% 
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>% 
  rename(catch_mil = value,
         fishery_type = name) %>% 
  mutate(fishery_full_name = case_when(
    # if the fishery_type is artisanal, then name it "Artisanal fishery"
    fishery_type == "artisanal_mil_tons" ~ "Artisanal fishery", 
    # if the fishery_type is industrial, then name it "Industrial fishery"
    fishery_type == "industrial_mil_tons" ~ "Industrial fishery",
    # if the fishery_type is subsistence, then name it "Subsistence fishery"
    fishery_type == "subsistence_mil_tons" ~ "Subsistence fishery"
  )) %>% # same as above
  
  # new function: filter
  # filters data frame for observations after 1980
  filter(year > 1980)

# A tibble: 90 × 4
    year fishery_type         catch_mil fishery_full_name  
   <dbl> <chr>                    <dbl> <chr>              
 1  1981 artisanal_mil_tons       15.3  Artisanal fishery  
 2  1981 industrial_mil_tons      62.5  Industrial fishery 
 3  1981 subsistence_mil_tons      3.99 Subsistence fishery
 4  1982 artisanal_mil_tons       15.2  Artisanal fishery  
 5  1982 industrial_mil_tons      67.5  Industrial fishery 
 6  1982 subsistence_mil_tons      4.08 Subsistence fishery
 7  1983 artisanal_mil_tons       15.3  Artisanal fishery  
 8  1983 industrial_mil_tons      65.7  Industrial fishery 
 9  1983 subsistence_mil_tons      4.20 Subsistence fishery
10  1984 artisanal_mil_tons       16.1  Artisanal fishery  
# ℹ 80 more rows

Putting it all together

All together to create a data frame called global_catch_clean

global_catch_clean <- global_catch %>% 
  
  # makes the column names nicer
  clean_names() %>% 
  
  # divides catch by 1000000 to calculate catch in million tons
  mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,
         industrial_mil_tons = industrial_large_scale_commercial/1000000,
         subsistence_mil_tons = subsistence/1000000) %>% 
  
  # selecting columns
  select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>% 
  
  # put the data frame in long format: each row is an observation
  pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>% 
  
  # renames columns so that they are easier to understand
  rename(catch_mil = value,
         fishery_type = name) %>% 
  
  # creates new column of full names for fisheries
  mutate(fishery_full_name = case_when(
    fishery_type == "artisanal_mil_tons" ~ "Artisanal fishery",
    fishery_type == "industrial_mil_tons" ~ "Industrial fishery",
    fishery_type == "subsistence_mil_tons" ~ "Subsistence fishery"
  )) %>% 
  
  # filters data frame for observations after 1980
  filter(year > 1980)

And just a quick look at this:

sample_n(global_catch_clean,
         size = 10)

# A tibble: 10 × 4
    year fishery_type         catch_mil fishery_full_name  
   <dbl> <chr>                    <dbl> <chr>              
 1  2010 artisanal_mil_tons       21.6  Artisanal fishery  
 2  2005 industrial_mil_tons      83.8  Industrial fishery 
 3  1991 artisanal_mil_tons       17.8  Artisanal fishery  
 4  2003 industrial_mil_tons      81.0  Industrial fishery 
 5  2000 artisanal_mil_tons       19.7  Artisanal fishery  
 6  1999 subsistence_mil_tons      4.06 Subsistence fishery
 7  1998 subsistence_mil_tons      4.03 Subsistence fishery
 8  1994 subsistence_mil_tons      4.14 Subsistence fishery
 9  2009 industrial_mil_tons      75.7  Industrial fishery 
10  2002 subsistence_mil_tons      3.88 Subsistence fishery