Looking at your data

Data source

In this example, we’re going to use global fishery catch data from Tidy Tuesday 2021-10-12.

Set up

First, we’ll read in the packages we need:

library(tidyverse) # general use
library(here) # file path organization
library(skimr) # quick summary stats

Then, we’ll read in the data:

# creating a new object called `global_catch`
global_catch <- read_csv(here("data", # data is in the "data" folder in this repository
                              "global-fishery-catch-by-sector.csv")) # file name

Quick look at the data using `glimpse()` and `str()`

glimpse() and str() both give a sense of the kinds of variables (numeric, factor, etc.) that are in each column, along with the contents of the columns and the column names.

glimpse() gives similar information as str(): use either (or both!).

glimpse(global_catch)

Rows: 61
Columns: 8
$ Entity                                <chr> "World", "World", "World", "Worl…
$ Code                                  <chr> "OWID_WRL", "OWID_WRL", "OWID_WR…
$ Year                                  <dbl> 1950, 1951, 1952, 1953, 1954, 19…
$ `Artisanal (small-scale commercial)`  <dbl> 7526795, 8278304, 8272109, 84692…
$ Discards                              <dbl> 5874170, 6278225, 7230311, 71729…
$ `Industrial (large-scale commercial)` <dbl> 14566338, 15417937, 16463942, 17…
$ Recreational                          <dbl> 268260, 284319, 293558, 292070, …
$ Subsistence                           <dbl> 2677833, 2704471, 2728141, 27530…

str(global_catch)

spc_tbl_ [61 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Entity                             : chr [1:61] "World" "World" "World" "World" ...
 $ Code                               : chr [1:61] "OWID_WRL" "OWID_WRL" "OWID_WRL" "OWID_WRL" ...
 $ Year                               : num [1:61] 1950 1951 1952 1953 1954 ...
 $ Artisanal (small-scale commercial) : num [1:61] 7526795 8278304 8272109 8469284 9226926 ...
 $ Discards                           : num [1:61] 5874170 6278225 7230311 7172937 8012930 ...
 $ Industrial (large-scale commercial): num [1:61] 14566338 15417937 16463942 17163789 18340199 ...
 $ Recreational                       : num [1:61] 268260 284319 293558 292070 304398 ...
 $ Subsistence                        : num [1:61] 2677833 2704471 2728141 2753098 2895153 ...
 - attr(*, "spec")=
  .. cols(
  ..   Entity = col_character(),
  ..   Code = col_character(),
  ..   Year = col_double(),
  ..   `Artisanal (small-scale commercial)` = col_double(),
  ..   Discards = col_double(),
  ..   `Industrial (large-scale commercial)` = col_double(),
  ..   Recreational = col_double(),
  ..   Subsistence = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Getting the column names using `colnames()`

You’ll be using the column names of your data frame a lot - you can use colnames() to figure out what the names actually are.

colnames(global_catch)

[1] "Entity"                              "Code"                               
[3] "Year"                                "Artisanal (small-scale commercial)" 
[5] "Discards"                            "Industrial (large-scale commercial)"
[7] "Recreational"                        "Subsistence"

A random sample of rows using `sample_n()`

You can get a quick look at a random sample of rows from the data frame using sample_n() to get a sense of what each row might look like.

sample_n(global_catch, # the data frame
         size = 10) # the number of rows to view

# A tibble: 10 × 8
   Entity Code      Year Artisanal (small-scal…¹ Discards Industrial (large-sc…²
   <chr>  <chr>    <dbl>                   <dbl>    <dbl>                  <dbl>
 1 World  OWID_WRL  1986                16227297 15091641               78606140
 2 World  OWID_WRL  1972                12753562 12357980               51789011
 3 World  OWID_WRL  1969                12211407 12761884               52626777
 4 World  OWID_WRL  1987                17449809 15669889               79755589
 5 World  OWID_WRL  1992                18417117 15717869               81793283
 6 World  OWID_WRL  1999                20588769 12426124               84833508
 7 World  OWID_WRL  1980                14688464 10500231               60489781
 8 World  OWID_WRL  1976                13761132 11095856               59084000
 9 World  OWID_WRL  1975                13432664 11313080               55854975
10 World  OWID_WRL  1998                20332081 13154260               79574228
# ℹ abbreviated names: ¹`Artisanal (small-scale commercial)`,
#   ²`Industrial (large-scale commercial)`
# ℹ 2 more variables: Recreational <dbl>, Subsistence <dbl>

Summary stats and more with `skim()`

The skim() function in the skimr package gives you a bunch of summary information about the data in each column. Note that this is a lot of information, but if you want some broad brush strokes level summary (e.g. what’s the mean in each column?), then skim() might be a good option!

skim(global_catch)

Data summary
Name	global_catch
Number of rows	61
Number of columns	8
_______________________
Column type frequency:
character	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Entity	0	1	5	5	0	1	0
Code	0	1	8	8	0	1	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	1	1980.0	17.75	1950	1965	1980	1995	2010	▇▇▇▇▇
Artisanal (small-scale commercial)	1	15180177.2	4255916.20	7526795	11653978	14688464	19665232	21828623	▅▇▆▅▇
Discards	1	11834829.3	2601005.59	5874170	10014530	11712961	13595692	16962727	▂▃▇▆▃
Industrial (large-scale commercial)	1	59491189.3	24241541.41	14566338	41475194	62459956	80593708	90068159	▃▂▃▃▇
Recreational	1	609926.8	194120.54	268260	405857	731511	769491	849021	▃▂▁▁▇
Subsistence	1	3777210.0	416715.61	2677833	3630156	3911536	4072844	4226487	▂▂▁▆▇

Data source

Set up

Quick look at the data using glimpse() and str()

Getting the column names using colnames()

A random sample of rows using sample_n()

Summary stats and more with skim()

Quick look at the data using `glimpse()` and `str()`

Getting the column names using `colnames()`

A random sample of rows using `sample_n()`

Summary stats and more with `skim()`