We’re going to keep using the data from the “Looking at your data” section, but now we’re going to clean it. The big functions we’ll use are:
clean_names(): cleans up column names
mutate(): creates new columns, changes columns (very powerful when used with case_when())
select(): selects columns from a data frame
pivot_longer(): puts the data frame in “long format” (each row is an observation)
rename(): renames columns
filter(): filters data frame
The original data frame
Just to remind ourselves, this is what the original data frame looks like:
sample_n( global_catch,size =10)
# A tibble: 10 × 8
Entity Code Year `Artisanal (small-scale commercial)` Discards `Industrial (large-scale commercial)` Recreational Subsistence
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 World OWID_WRL 1968 12303383 13871018 55617156 460158 3916270
2 World OWID_WRL 2001 20046446 11506452 84201719 769491 3955601
3 World OWID_WRL 1955 9545554 8340179 18728089 313420 3012167
4 World OWID_WRL 1963 11005246 12332187 36093501 392172 3826997
5 World OWID_WRL 1952 8272109 7230311 16463942 293558 2728141
6 World OWID_WRL 1960 10703245 10896146 25853941 360842 3479410
7 World OWID_WRL 1983 15300068 12388473 65689361 849021 4200135
8 World OWID_WRL 1973 13688372 12031176 51308375 548785 3957258
9 World OWID_WRL 1974 13309012 11416434 56720856 583018 3866687
10 World OWID_WRL 1997 20401848 14321828 88205198 774972 4044958
Making column names nicer
Function:clean_names() Package:janitor
global_catch %>%# new function: clean_names# makes the column names nicer! # compare this with the column names from `colnames(global_catch)` outputclean_names()
# A tibble: 61 × 8
entity code year artisanal_small_scale_commercial discards industrial_large_scale_commercial recreational subsistence
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 World OWID_WRL 1950 7526795 5874170 14566338 268260 2677833
2 World OWID_WRL 1951 8278304 6278225 15417937 284319 2704471
3 World OWID_WRL 1952 8272109 7230311 16463942 293558 2728141
4 World OWID_WRL 1953 8469284 7172937 17163789 292070 2753098
5 World OWID_WRL 1954 9226926 8012930 18340199 304398 2895153
6 World OWID_WRL 1955 9545554 8340179 18728089 313420 3012167
7 World OWID_WRL 1956 10303408 8692551 19902604 319333 3058523
8 World OWID_WRL 1957 10425695 8998732 20030089 339291 3112506
9 World OWID_WRL 1958 10172920 9255992 20520801 353633 3187292
10 World OWID_WRL 1959 10385711 9908003 23422702 355360 3296620
# ℹ 51 more rows
Creating new columns
Function:mutate() Package:dplyr (in tidyverse)
In this line of code, we’re calculating catch in million tons.
global_catch %>%clean_names() %>%# same as above# new function: mutate# create new columns to calculate catch divided by 1000000mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,industrial_mil_tons = industrial_large_scale_commercial/1000000,subsistence_mil_tons = subsistence/1000000)
Function:pivot_longer() Package:dplyr (in tidyverse)
global_catch %>%clean_names() %>%mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,industrial_mil_tons = industrial_large_scale_commercial/1000000,subsistence_mil_tons = subsistence/1000000) %>%select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>%# same as above# new function: pivot_longer# put the data frame in long format: each row is an observation# in this case, each row is a fishery with some catch (in million tons) in a given yearpivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons)
Compare this with the output from the “Creating new columns” section
There are still 3 columns in this data frame, but now there’s a column called name and another called value. Each row is the catch in million tons for a fisher (either artisanal, industrial, or subsistence) in a given year.
Renaming columns
Function:rename() Package:dplyr (in tidyverse)
global_catch %>%clean_names() %>%mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,industrial_mil_tons = industrial_large_scale_commercial/1000000,subsistence_mil_tons = subsistence/1000000) %>%select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>%pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>%# same as above# new function: rename# renames columns so that they are easier to understand# arguments: "new name" = "old name"rename(catch_mil = value,fishery_type = name)
Function:case_when() and mutate() Package:dplyr (in tidyverse)
This creates a new column with the “full name” for each fishery!
global_catch %>%clean_names() %>%mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,industrial_mil_tons = industrial_large_scale_commercial/1000000,subsistence_mil_tons = subsistence/1000000) %>%select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>%pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>%rename(catch_mil = value,fishery_type = name) %>%# same as above# new function: mutate with case_when# creates new column of full names for fisheriesmutate(fishery_full_name =case_when(# if the fishery_type is artisanal, then name it "Artisanal fishery" fishery_type =="artisanal_mil_tons"~"Artisanal fishery", # if the fishery_type is industrial, then name it "Industrial fishery" fishery_type =="industrial_mil_tons"~"Industrial fishery",# if the fishery_type is subsistence, then name it "Subsistence fishery" fishery_type =="subsistence_mil_tons"~"Subsistence fishery" ))
Filtering the data frame based on what’s in a column
Function:filter() Package:dplyr (in tidyverse)
This filters the data frame to only include observations after 1980 in the year column.
global_catch %>%clean_names() %>%mutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,industrial_mil_tons = industrial_large_scale_commercial/1000000,subsistence_mil_tons = subsistence/1000000) %>%select(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>%pivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>%rename(catch_mil = value,fishery_type = name) %>%mutate(fishery_full_name =case_when(# if the fishery_type is artisanal, then name it "Artisanal fishery" fishery_type =="artisanal_mil_tons"~"Artisanal fishery", # if the fishery_type is industrial, then name it "Industrial fishery" fishery_type =="industrial_mil_tons"~"Industrial fishery",# if the fishery_type is subsistence, then name it "Subsistence fishery" fishery_type =="subsistence_mil_tons"~"Subsistence fishery" )) %>%# same as above# new function: filter# filters data frame for observations after 1980filter(year >1980)
All together to create a data frame called global_catch_clean
global_catch_clean <- global_catch %>%# makes the column names nicerclean_names() %>%# divides catch by 1000000 to calculate catch in million tonsmutate(artisanal_mil_tons = artisanal_small_scale_commercial/1000000,industrial_mil_tons = industrial_large_scale_commercial/1000000,subsistence_mil_tons = subsistence/1000000) %>%# selecting columnsselect(year, artisanal_mil_tons, industrial_mil_tons, subsistence_mil_tons) %>%# put the data frame in long format: each row is an observationpivot_longer(cols = artisanal_mil_tons:subsistence_mil_tons) %>%# renames columns so that they are easier to understandrename(catch_mil = value,fishery_type = name) %>%# creates new column of full names for fisheriesmutate(fishery_full_name =case_when( fishery_type =="artisanal_mil_tons"~"Artisanal fishery", fishery_type =="industrial_mil_tons"~"Industrial fishery", fishery_type =="subsistence_mil_tons"~"Subsistence fishery" )) %>%# filters data frame for observations after 1980filter(year >1980)