library(tidyverse) # general use
library(here) # file path organization
library(skimr) # quick summary stats
Looking at your data
Data source
In this example, we’re going to use global fishery catch data from Tidy Tuesday 2021-10-12.
Set up
First, we’ll read in the packages we need:
Then, we’ll read in the data:
# creating a new object called `global_catch`
<- read_csv(here("data", # data is in the "data" folder in this repository
global_catch "global-fishery-catch-by-sector.csv")) # file name
Quick look at the data using glimpse()
and str()
glimpse()
and str()
both give a sense of the kinds of variables (numeric, factor, etc.) that are in each column, along with the contents of the columns and the column names.
glimpse()
gives similar information as str()
: use either (or both!).
glimpse(global_catch)
Rows: 61
Columns: 8
$ Entity <chr> "World", "World", "World", "Worl…
$ Code <chr> "OWID_WRL", "OWID_WRL", "OWID_WR…
$ Year <dbl> 1950, 1951, 1952, 1953, 1954, 19…
$ `Artisanal (small-scale commercial)` <dbl> 7526795, 8278304, 8272109, 84692…
$ Discards <dbl> 5874170, 6278225, 7230311, 71729…
$ `Industrial (large-scale commercial)` <dbl> 14566338, 15417937, 16463942, 17…
$ Recreational <dbl> 268260, 284319, 293558, 292070, …
$ Subsistence <dbl> 2677833, 2704471, 2728141, 27530…
str(global_catch)
spc_tbl_ [61 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Entity : chr [1:61] "World" "World" "World" "World" ...
$ Code : chr [1:61] "OWID_WRL" "OWID_WRL" "OWID_WRL" "OWID_WRL" ...
$ Year : num [1:61] 1950 1951 1952 1953 1954 ...
$ Artisanal (small-scale commercial) : num [1:61] 7526795 8278304 8272109 8469284 9226926 ...
$ Discards : num [1:61] 5874170 6278225 7230311 7172937 8012930 ...
$ Industrial (large-scale commercial): num [1:61] 14566338 15417937 16463942 17163789 18340199 ...
$ Recreational : num [1:61] 268260 284319 293558 292070 304398 ...
$ Subsistence : num [1:61] 2677833 2704471 2728141 2753098 2895153 ...
- attr(*, "spec")=
.. cols(
.. Entity = col_character(),
.. Code = col_character(),
.. Year = col_double(),
.. `Artisanal (small-scale commercial)` = col_double(),
.. Discards = col_double(),
.. `Industrial (large-scale commercial)` = col_double(),
.. Recreational = col_double(),
.. Subsistence = col_double()
.. )
- attr(*, "problems")=<externalptr>
Getting the column names using colnames()
You’ll be using the column names of your data frame a lot - you can use colnames()
to figure out what the names actually are.
colnames(global_catch)
[1] "Entity" "Code"
[3] "Year" "Artisanal (small-scale commercial)"
[5] "Discards" "Industrial (large-scale commercial)"
[7] "Recreational" "Subsistence"
A random sample of rows using sample_n()
You can get a quick look at a random sample of rows from the data frame using sample_n()
to get a sense of what each row might look like.
sample_n(global_catch, # the data frame
size = 10) # the number of rows to view
# A tibble: 10 × 8
Entity Code Year Artisanal (small-scal…¹ Discards Industrial (large-sc…²
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 World OWID_WRL 1986 16227297 15091641 78606140
2 World OWID_WRL 1972 12753562 12357980 51789011
3 World OWID_WRL 1969 12211407 12761884 52626777
4 World OWID_WRL 1987 17449809 15669889 79755589
5 World OWID_WRL 1992 18417117 15717869 81793283
6 World OWID_WRL 1999 20588769 12426124 84833508
7 World OWID_WRL 1980 14688464 10500231 60489781
8 World OWID_WRL 1976 13761132 11095856 59084000
9 World OWID_WRL 1975 13432664 11313080 55854975
10 World OWID_WRL 1998 20332081 13154260 79574228
# ℹ abbreviated names: ¹`Artisanal (small-scale commercial)`,
# ²`Industrial (large-scale commercial)`
# ℹ 2 more variables: Recreational <dbl>, Subsistence <dbl>
Summary stats and more with skim()
The skim()
function in the skimr
package gives you a bunch of summary information about the data in each column. Note that this is a lot of information, but if you want some broad brush strokes level summary (e.g. what’s the mean in each column?), then skim()
might be a good option!
skim(global_catch)
Name | global_catch |
Number of rows | 61 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Entity | 0 | 1 | 5 | 5 | 0 | 1 | 0 |
Code | 0 | 1 | 8 | 8 | 0 | 1 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 1980.0 | 17.75 | 1950 | 1965 | 1980 | 1995 | 2010 | ▇▇▇▇▇ |
Artisanal (small-scale commercial) | 0 | 1 | 15180177.2 | 4255916.20 | 7526795 | 11653978 | 14688464 | 19665232 | 21828623 | ▅▇▆▅▇ |
Discards | 0 | 1 | 11834829.3 | 2601005.59 | 5874170 | 10014530 | 11712961 | 13595692 | 16962727 | ▂▃▇▆▃ |
Industrial (large-scale commercial) | 0 | 1 | 59491189.3 | 24241541.41 | 14566338 | 41475194 | 62459956 | 80593708 | 90068159 | ▃▂▃▃▇ |
Recreational | 0 | 1 | 609926.8 | 194120.54 | 268260 | 405857 | 731511 | 769491 | 849021 | ▃▂▁▁▇ |
Subsistence | 0 | 1 | 3777210.0 | 416715.61 | 2677833 | 3630156 | 3911536 | 4072844 | 4226487 | ▂▂▁▆▇ |