Looking at your data

Data source

In this example, we’re going to use global fishery catch data from Tidy Tuesday 2021-10-12.

Set up

First, we’ll read in the packages we need:

library(tidyverse) # general use
library(here) # file path organization
library(skimr) # quick summary stats 

Then, we’ll read in the data:

# creating a new object called `global_catch`
global_catch <- read_csv(here("data", # data is in the "data" folder in this repository
                              "global-fishery-catch-by-sector.csv")) # file name

Quick look at the data using glimpse() and str()

glimpse() and str() both give a sense of the kinds of variables (numeric, factor, etc.) that are in each column, along with the contents of the columns and the column names.

glimpse() gives similar information as str(): use either (or both!).

glimpse(global_catch)
Rows: 61
Columns: 8
$ Entity                                <chr> "World", "World", "World", "Worl…
$ Code                                  <chr> "OWID_WRL", "OWID_WRL", "OWID_WR…
$ Year                                  <dbl> 1950, 1951, 1952, 1953, 1954, 19…
$ `Artisanal (small-scale commercial)`  <dbl> 7526795, 8278304, 8272109, 84692…
$ Discards                              <dbl> 5874170, 6278225, 7230311, 71729…
$ `Industrial (large-scale commercial)` <dbl> 14566338, 15417937, 16463942, 17…
$ Recreational                          <dbl> 268260, 284319, 293558, 292070, …
$ Subsistence                           <dbl> 2677833, 2704471, 2728141, 27530…
str(global_catch)
spc_tbl_ [61 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Entity                             : chr [1:61] "World" "World" "World" "World" ...
 $ Code                               : chr [1:61] "OWID_WRL" "OWID_WRL" "OWID_WRL" "OWID_WRL" ...
 $ Year                               : num [1:61] 1950 1951 1952 1953 1954 ...
 $ Artisanal (small-scale commercial) : num [1:61] 7526795 8278304 8272109 8469284 9226926 ...
 $ Discards                           : num [1:61] 5874170 6278225 7230311 7172937 8012930 ...
 $ Industrial (large-scale commercial): num [1:61] 14566338 15417937 16463942 17163789 18340199 ...
 $ Recreational                       : num [1:61] 268260 284319 293558 292070 304398 ...
 $ Subsistence                        : num [1:61] 2677833 2704471 2728141 2753098 2895153 ...
 - attr(*, "spec")=
  .. cols(
  ..   Entity = col_character(),
  ..   Code = col_character(),
  ..   Year = col_double(),
  ..   `Artisanal (small-scale commercial)` = col_double(),
  ..   Discards = col_double(),
  ..   `Industrial (large-scale commercial)` = col_double(),
  ..   Recreational = col_double(),
  ..   Subsistence = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Getting the column names using colnames()

You’ll be using the column names of your data frame a lot - you can use colnames() to figure out what the names actually are.

colnames(global_catch)
[1] "Entity"                              "Code"                               
[3] "Year"                                "Artisanal (small-scale commercial)" 
[5] "Discards"                            "Industrial (large-scale commercial)"
[7] "Recreational"                        "Subsistence"                        

A random sample of rows using sample_n()

You can get a quick look at a random sample of rows from the data frame using sample_n() to get a sense of what each row might look like.

sample_n(global_catch, # the data frame
         size = 10) # the number of rows to view
# A tibble: 10 × 8
   Entity Code      Year Artisanal (small-scal…¹ Discards Industrial (large-sc…²
   <chr>  <chr>    <dbl>                   <dbl>    <dbl>                  <dbl>
 1 World  OWID_WRL  1986                16227297 15091641               78606140
 2 World  OWID_WRL  1972                12753562 12357980               51789011
 3 World  OWID_WRL  1969                12211407 12761884               52626777
 4 World  OWID_WRL  1987                17449809 15669889               79755589
 5 World  OWID_WRL  1992                18417117 15717869               81793283
 6 World  OWID_WRL  1999                20588769 12426124               84833508
 7 World  OWID_WRL  1980                14688464 10500231               60489781
 8 World  OWID_WRL  1976                13761132 11095856               59084000
 9 World  OWID_WRL  1975                13432664 11313080               55854975
10 World  OWID_WRL  1998                20332081 13154260               79574228
# ℹ abbreviated names: ¹​`Artisanal (small-scale commercial)`,
#   ²​`Industrial (large-scale commercial)`
# ℹ 2 more variables: Recreational <dbl>, Subsistence <dbl>

Summary stats and more with skim()

The skim() function in the skimr package gives you a bunch of summary information about the data in each column. Note that this is a lot of information, but if you want some broad brush strokes level summary (e.g. what’s the mean in each column?), then skim() might be a good option!

skim(global_catch)
Data summary
Name global_catch
Number of rows 61
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Entity 0 1 5 5 0 1 0
Code 0 1 8 8 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 1980.0 17.75 1950 1965 1980 1995 2010 ▇▇▇▇▇
Artisanal (small-scale commercial) 0 1 15180177.2 4255916.20 7526795 11653978 14688464 19665232 21828623 ▅▇▆▅▇
Discards 0 1 11834829.3 2601005.59 5874170 10014530 11712961 13595692 16962727 ▂▃▇▆▃
Industrial (large-scale commercial) 0 1 59491189.3 24241541.41 14566338 41475194 62459956 80593708 90068159 ▃▂▃▃▇
Recreational 0 1 609926.8 194120.54 268260 405857 731511 769491 849021 ▃▂▁▁▇
Subsistence 0 1 3777210.0 416715.61 2677833 3630156 3911536 4072844 4226487 ▂▂▁▆▇