library(tidyverse)
SBC LTER Data Workshop
Part 1. Downloading and visually exploring datasets
Learning objectives
By the end of this section of the workshop, you will be able to:
1. use R packages to download LTER data to your own computer
2. visualize data as a first step to exploring LTER datasets
Dataset
In this section of the workshop, we’ll work with the seasonal timeseries data. The full citation is: Reed, D. and R. Miller. 2024. SBC LTER: Reef: Seasonal Kelp Forest Community Dynamics: biomass of kelp forest species, ongoing since 2008 ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/2e3b1cf934ec4f4a9293ba117aad37f5.
At the end of the workshop, we’ll have a visualization of purple and red urchin biomass through time at Naples Reef from 2014-2024.
1. set up
The tidyverse
package is a general use package for cleaning and visualizing data. See https://www.tidyverse.org/ for details on what packages are in the tidyverse!
The janitor
package contains a bunch of different functions to help you clean up your data. See the website for more details: https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html
library(janitor)
2. getting data from EDI
In this section, we’ll download the data from EDI and save it as an object called dt1
. This is following the code that is provided on every dataset on EDI. You can find it on the dataset page under Code Generation.
# getting the dataset URL and saving it as an object called `inUrl1`
<- "https://pasta.lternet.edu/package/data/eml/knb-lter-sbc/182/1/8bd628cbfbb5bef0d9b1a7ead8aab832"
inUrl1
# creating an object called `infile1` that is a temporary file
<- tempfile()
infile1
# download the dataset specified in `inUrl1` and store it in `infile1`
try(download.file(inUrl1,infile1,method="curl",extra=paste0(' -A "',getOption("HTTPUserAgent"),'"')))
if (is.na(file.size(infile1))) download.file(inUrl1,infile1,method="auto")
# read in the dataset using the function `read.csv()`
# store the dataset as an object called `dt1`
<-read.csv(infile1,header=F
dt1 skip=1
,sep=","
,quot='"'
,col.names=c(
, "YEAR",
"MONTH",
"DATE",
"SITE",
"TRANSECT",
"VIS",
"SP_CODE",
"PERCENT_COVER",
"DENSITY",
"WM_GM2",
"DRY_GM2",
"SFDM",
"AFDM",
"SCIENTIFIC_NAME",
"COMMON_NAME",
"TAXON_KINGDOM",
"TAXON_CLASS",
"TAXON_PHYLUM",
"TAXON_ORDER",
"TAXON_FAMILY",
"TAXON_GENUS",
"GROUP",
"MOBILITY",
"GROWTH_MORPH" ), check.names=TRUE)
# unlink the temporary file
unlink(infile1)
As of this workshop (20 February 2025), this code to load in the data still works. However, EDI will soon switch to an API-based system, which will require all users to log in and authenticate themselves before accessing the data. Look out for changes to EDI in the future!
3. getting a first look at the data
The first thing to do when working with data is to look at it. This is an underappreciated but crucial step! By looking at the data, you’ll have a sense of what the columns and rows are so you can then build your intuition for how to work with it down the line.
One way to quickly look at the data is to use the glimpse()
function, which comes from the dplyr
package within the tidyverse
.
glimpse(dt1)
Rows: 84,316
Columns: 24
$ YEAR <int> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, …
$ MONTH <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ DATE <chr> "2008-01-10", "2008-01-10", "2008-01-10", "2008-01-10"…
$ SITE <chr> "NAPL", "NAPL", "NAPL", "NAPL", "NAPL", "NAPL", "NAPL"…
$ TRANSECT <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ VIS <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ SP_CODE <chr> "AB", "AHOL", "AL", "AML", "AMZO", "ANDA", "ANSP", "AP…
$ PERCENT_COVER <dbl> 0.00, -99999.00, 1.25, -99999.00, 0.00, -99999.00, -99…
$ DENSITY <dbl> -99999.00, 0.00, -99999.00, 4.55, -99999.00, 0.00, 3.5…
$ WM_GM2 <dbl> 0.0000000, 0.0000000, 0.0597675, 233.3452338, 0.000000…
$ DRY_GM2 <dbl> 0.00000000, 0.00000000, 0.03687655, 71.40364154, 0.000…
$ SFDM <dbl> 0.000000e+00, -9.999900e+04, 2.121746e-02, 4.876915e+0…
$ AFDM <dbl> 0.000000000, 0.000000000, 0.009503032, 34.301749368, 0…
$ SCIENTIFIC_NAME <chr> "Abietinaria spp.", "Alloclinus holderi", "Astrangia h…
$ COMMON_NAME <chr> "Coarse Sea Fir Hydroid", "Island Kelpfish", "Aggregat…
$ TAXON_KINGDOM <chr> "Animalia", "Animalia", "Animalia", "Animalia", "Plant…
$ TAXON_CLASS <chr> "Cnidaria", "Chordata", "Cnidaria", "Echinodermata", "…
$ TAXON_PHYLUM <chr> "Hydrozoa", "Actinopterygii", "Anthozoa", "Asteroidea"…
$ TAXON_ORDER <chr> "Leptothecata", "Perciformes", "Scleractinia", "Valvat…
$ TAXON_FAMILY <chr> "Sertulariidae", "Labrisomidae", "Rhizangiidae", "Aste…
$ TAXON_GENUS <chr> "Abietinaria", "Alloclinus", "Astrangia", "Patiria", "…
$ GROUP <chr> "INVERT", "FISH", "INVERT", "INVERT", "ALGAE", "FISH",…
$ MOBILITY <chr> "SESSILE", "MOBILE", "SESSILE", "MOBILE", "SESSILE", "…
$ GROWTH_MORPH <chr> "AGGREGATED", "SOLITARY", "AGGREGATED", "SOLITARY", "A…
From this, you know there are 84,316 rows in the dataset, and 24 columns. This function also gives you information about what is in the dataset; for example, there is a column called YEAR
and it has integers (<int>
) in it, and you can see that there are years in the column (e.g. 2008).
You can also look directly at the object by typing View(dt1)
in your console, or clicking on dt1
in the Environment tab.
View(dt1)
Given that we’re interested in the biomass of red and purple urchins at Naples through time from 2014-2024, which columns would be most relevant to us?
Answer: Probably YEAR, DATE, SITE, DRY_GM2, SCIENTIFIC_NAME (could be others too!)
4. cleaning the data
In this section, we’ll filter the dataset and clean it up a bit so that it’s easier to use. Remember that we’re only interested in purple and red urchins at Naples, but this dataset includes biomass for all species covered in the surveys at all the sites. We’ll need to filter the dataset to include only the species and site of interest, but we’ll also do a couple of cleaning steps to make the dataset easier to use.
# we'll create a new object called `urchins`, and start first with `dt1`
<- dt1 |>
urchins # using this function from `janitor` to make the column names lower case
clean_names() |>
# select columns of interest
select(year, date, site, dry_gm2, common_name) |>
# make sure the date column is read in as a date
mutate(date = as.Date(date)) |>
# make site and common name lower case
mutate_at(c("site", "common_name"), str_to_lower) |>
# filter to only include observations between 2014 and 2024
filter(between(year, 2014, 2024)) |>
# filter to only include red and purple urchins
filter(common_name %in% c("red urchin", "purple urchin")) |>
# filter to only include Naples
filter(site == "napl")
By now, you know that you should look at your data before you start working with it. But for these long chains of functions piped into each other, it’s a good idea to look at the data frame you’re modifying after each function to make sure that it still looks the way you’d expect. That way, if something goes wrong, you’ll now what step to fix!
5. visualizing the data
Now the most exciting part: visualizing the data! Remember that we want to create a timeseries plot for urchin biomass. That means we’ll have time (in this case, this is the column date
) on the x-axis, and biomass (column name dry_gm2
) on the y-axis.
To visualize the data, we’ll use the function ggplot()
and its associated functions from the ggplot2
package in the tidyverse
. See https://ggplot2-book.org/index.html for a user manual to ggplot2
. But don’t stop there! Lots of people think deeply about data visualization. I (An) love thinking about visualizing data!! So if you ever want to chat about making visualizations, I am super happy to talk!
Each ggplot visualization has the same 3 basic parts:
1. the “global call”: this tells R that you want to use ggplot by calling the function ggplot()
2. the “aesthetics”: this outlines what you want the axes, colors, etc. to be by calling the function aes()
within the global ggplot()
call
3. the “geometries”: these define the shapes (e.g. points, lines), colors, etc. that you want to draw on your plot. These functions all start with geom_
.
To make your visualization more beautiful, you can adjust things like:
a. colors (using scale_color_manual()
or other functions),
b. the overall “look” (using theme_
functions), and/or
c. labels (using labs()
)
but again, there is much more than these options!
# 1. global call
ggplot(data = urchins, # specifying data
# 2. aesthetics
aes(x = date, # x-axis
y = dry_gm2, # y-axis
color = common_name, # what we want the colors to be
shape = common_name)) +
# 3. geometries
# making points
geom_point(size = 2) +
# making lines to connect the points
geom_line() +
# extra stuff!
# a. changing the colors
scale_color_manual(values = c("purple urchin" = "darkorchid4",
"red urchin" = "firebrick2")) +
# b. changing the theme
theme_minimal() +
# c. changing the labels (note that the x, y, color, and shape arguments
# are the same as the aes() call above)
labs(x = "Date",
y = "Dry biomass (g/m2)",
color = "Species",
shape = "Species",
title = "Timeseries of purple and red urchin biomass, 2014-2024")
This concludes part 1 of the workshop! If you have any questions or want to nerd out about data viz, please feel free to contact me (An Bui, an_bui@ucsb.edu).