Introduction to palaeoverse

A community-driven R package
By Lewis A. Jones

North American Paleontological Convention 2024

Development team

Lewis A. Jones, Universidade de Vigo
William Gearty, American Museum of Natural History
Bethany J. Allen, ETH Zürich
Kilian Eichenseer, University of Durham
Christopher D. Dean, University College London
Sofía Galván, Universidade de Vigo
Miranta Kouvari, University College London

Pedro L. Godoy, University of São Paulo
Cecily Nicholl, University College London
Lucas Buffan, École Normale Supérieure de Lyon
Erin M. Dillon, Smithsonian Tropical Research Institute
Alfio A. Chiarenza, Universidade de Vigo

Introduction

The long and the short of it 📏…

What is Palaeoverse?

Palaeoverse is a project that aims to bring the palaeobiology community together.

What is the palaeoverse R package?

palaeoverse provides auxiliary functions to support data preparation and exploration.

Improve code readability, reusability and reproducibility.

What makes palaeoverse different?

What makes palaeoverse different?

Community-informed development
- Authors (n = 13)
- Survey participants (n = 35)

Well-documented & peer-reviewed code
- Formal review process

A community-driven package
- http://palaeoverse.palaeoverse.org

Functionality

A whistle-stop tour of palaeoverse 🚋…

What’s available?

axis_geo
bin_lat
bin_time
data
group_apply
lat_bins
look_up
palaeorotate
phylo_check

tax_check
tax_expand_lat
tax_expand_time
tax_range_space
tax_range_time
tax_range_strat
tax_unique
time_bins

Expected input

A lot of data, a lot of sources, and a lot of unique features.

Data structure, not source.

occdf \(\rightarrow\) function(x) \(\rightarrow\) df

Occurrence dataframe*

Getting started

Let’s dive in 🤿…

Installation

palaeoverse can be installed from the CRAN using:

install.packages("palaeoverse")

The development version can be installed using devtools:

devtools::install_github("palaeoverse/palaeoverse")

Once installed, load the package in the usual manner:

library(palaeoverse)

Example datasets

Two example occurrence datasets are available.

Carboniferous–Early Triassic tetrapods (n = 5270, Paleobiology Database).

Code

# Get details on dataset
?tetrapods
# Load dataset
data("tetrapods")
# Available variables
colnames(tetrapods)
##  [1] "occurrence_no"     "collection_no"     "identified_name"  
##  [4] "identified_rank"   "accepted_name"     "accepted_rank"    
##  [7] "early_interval"    "late_interval"     "max_ma"           
## [10] "min_ma"            "phylum"            "class"            
## [13] "order"             "family"            "genus"            
## [16] "abund_value"       "abund_unit"        "lng"              
## [19] "lat"               "collection_name"   "cc"               
## [22] "formation"         "stratgroup"        "member"           
## [25] "zone"              "lithology1"        "environment"      
## [28] "pres_mode"         "taxon_environment" "motility"         
## [31] "life_habit"        "diet"

Phanerozoic reef occurrences (n = 4363, PaleoReefs Database).

Code

# Get details on dataset
?reefs
# Load dataset
data("reefs")
# Available variables
colnames(reefs)
##  [1] "r_number"   "name"       "formation"  "system"     "series"    
##  [6] "interval"   "biota_main" "biota_sec"  "lng"        "lat"       
## [11] "country"    "authors"    "title"      "year"

Reference datasets

Two reference datasets are available.

Geological Time Scale 2012 & 2020 (Gradstein et al. 2012; 2020).

# Get details on dataset
?GTS2012
?GTS2020
# Load dataset
data("GTS2012")
data("GTS2020")
# Increase output width
options(width = 120)
# Print first few rows
head(GTS2012, n = 3)
##   interval_number      interval_name  rank max_ma mid_ma min_ma duration_myr  font  colour abbr
## 1               1           Holocene stage 0.0117 0.0059 0.0000       0.0117 black #FDEDEC <NA>
## 2               2  Upper Pleistocene stage 0.1260 0.0688 0.0117       0.1143 black #FFF2D3 <NA>
## 3               3 Middle Pleistocene stage 0.7810 0.4535 0.1260       0.6550 black #FFF2C7 <NA>
head(GTS2020, n = 3)
##   interval_number interval_name  rank max_ma  mid_ma min_ma duration_myr  font  colour abbr
## 1               1    Meghalayan stage 0.0042 0.00210 0.0000       0.0042 black #FDEDEC <NA>
## 2               2 Northgrippian stage 0.0082 0.00620 0.0042       0.0040 black #FDECE4 <NA>
## 3               3  Greenlandian stage 0.0117 0.00995 0.0082       0.0035 black #FEECDB <NA>

Stratigraphic time bins

# Get stage-level time bins
bins <- time_bins(interval = "Phanerozoic", rank = "stage", plot = TRUE)

# Get first few rows
head(bins, n = 3)
##   bin interval_name  rank max_ma mid_ma min_ma duration_myr abbr  colour  font
## 1   1     Fortunian stage    541  535.0    529           12   Fo #99B575 black
## 2   2       Stage 2 stage    529  525.0    521            8   S2 #A6BA80 black
## 3   3       Stage 3 stage    521  517.5    514            7   S3 #A6C583 black

Macrostrat time bins

# Get North American Land Mammal Ages
bins <- time_bins(scale = "North American Land Mammal Ages", plot = TRUE)

# Get first few rows
head(bins, n = 3)
##   bin interval_name                            rank max_ma mid_ma min_ma duration_myr abbr  colour  font
## 1   1       Puercan North American Land Mammal Ages  66.00 65.375  64.75         1.25    P #FDB469 black
## 2   2   Torrejonian North American Land Mammal Ages  64.75 63.500  62.25         2.50   To #FEBA64 black
## 3   3     Tiffanian North American Land Mammal Ages  62.25 59.875  57.50         4.75   Ti #FEBF6A black

Near-equal-length time bins

# Get stage-level time bins
bins <- time_bins(interval = "Phanerozoic", rank = "stage", size = 15, plot = TRUE)

# Get first few rows
head(bins, n = 3)
##   bin max_ma mid_ma min_ma duration_myr grouping_rank                 intervals  colour  font
## 1   1    541 535.00  529.0         12.0         stage                 Fortunian #80cdc1 black
## 2   2    529 521.50  514.0         15.0         stage          Stage 3, Stage 2 #80cdc1 black
## 3   3    514 507.25  500.5         13.5         stage Drumian, Wuliuan, Stage 4 #80cdc1 black

Temporal occurrence binning

Five temporal binning methods for age range data:

# Use tetrapod example data
occdf <- tetrapods

# Get stage-level time bins
bins <- time_bins(interval = "Phanerozoic", rank = "stage")

# Assign via midpoint age of fossil occurrence data
ex1 <- bin_time(occdf = occdf, bins = bins, method = "mid")

# Assign to all bins that age range covers
ex2 <- bin_time(occdf = occdf, bins = bins, method = "all")

# Assign via majority overlap based on fossil occurrence age range
ex3 <- bin_time(occdf = occdf, bins = bins, method = "majority")

# Randomly assign to overlapping bins based on fossil occurrence age range
ex4 <- bin_time(occdf = occdf, bins = bins, method = "random", reps = 10)

# Randomly assign point estimates (e.g. uniform distribution) based on fossil occurrence age range
ex5 <- bin_time(occdf = occdf, bins = bins, method = "point", reps = 10)

Latitudinal occurrence binning

Generate and bin latitudinal data:

# Generate latitudinal bins
bins <- lat_bins(size = 10, plot = TRUE)

# Use reef example data
occdf <- reefs
# Bin occurrences
occdf <- bin_lat(occdf = occdf, bins = bins, lat = "lat")

Spatial occurrence binning

Generate and bin spatial data:

# Get reef data
occdf <- reefs[1:500, ]
# Bin data using a hexagonal equal-area grid
occdf <- bin_space(occdf = occdf, spacing = 250, return = TRUE)
# Plot world and grid using ggplot2
library(ggplot2)
library(rnaturalearth)
world <- ne_countries(scale = "small",returnclass = "sf")
ggplot() +
  geom_sf(data = world, colour = "black", fill = "lightgrey") + 
  geom_sf(data = occdf$grid, fill = "orange", colour = "black") + 
  theme_void()

Palaeogeographic reconstruction

Palaeorotate fossil occurrences (multiple models available):

# Example with a few occurrences
occdf <- data.frame(lng = c(2, -103, -66),
                    lat = c(46, 35, -7),
                    age = c(88, 125, 200))

# Estimate palaeocoordinates using the GPlates API
ex1 <- palaeorotate(occdf = occdf, method = "point")

# Estimate palaeocoordinates using reconstruction files
ex2 <- palaeorotate(occdf = occdf, method = "grid")

# Estimate palaeocoordinates and uncertainty using reconstruction files
ex3 <- palaeorotate(occdf = occdf, method = "grid", uncertainty = TRUE)

# Increase output width
options(width = 400)
# Get first few rows
head(ex3)
##    lng lat age   rot_model rot_age rot_lng rot_lat    p_lng    p_lat
## 1    2  46  88 MERDITH2021      88    1.80   46.42  13.0134  37.6406
## 2 -103  35 125 MERDITH2021     127 -102.61   34.63 -41.8928  35.0437
## 3  -66  -7 200 MERDITH2021     200  -65.52   -6.95 -22.5209 -16.7714

In order to study the past distribution of taxa, the palaeogeographic position of fossil occurrences need to be reconstructed using plate rotation models. The palaeorotate function provides two methods to reconstruct palaeocoordinates for fossil occurrences. The first method, “point”, uses the GPlates Web Service, which estimates palaeocoordinates for point data, with eight different models available. The second method, “grid”, uses pre-generated reconstruction files to estimate palaeocoordinates. This approach is much faster than the point method for large datasets, and also allows easy exploration. Additional functionality for the second method allows the user to calculate the palaeogeographic uncertainty in reconstructions from across the eight available models with the palaeolatitudinal range and maximum circle distance between points calculated.

Taxonomic spell check

Identify and count potential spelling variations of the same taxon:

Code

# load occurrence data
data("tetrapods")
# Check taxon names alphabetically
ex1 <- tax_check(taxdf = tetrapods, name = "genus", dis = 0.05, verbose = FALSE)
# Get first few rows
head(ex1)
##   group     greater     lesser count_greater count_lesser
## 1     D Dvinosaurus Dinosaurus            23            2
## 2     V   Varanopus   Varanops             5            3

Code

# Check taxon names by group
ex2 <- tax_check(taxdf = tetrapods, name = "genus", group = "family", dis = 0.05, verbose = FALSE)
# Get first few rows
head(ex2)
## NULL

In this example dataset:

Dinosaurus belongs to the Phthinosuchidae
Dvinosaurus belongs to the Dvinosauridae
Varanops belongs to the Varanopidae
Varanopus belongs to the Captorhinidae

Unique taxa

Identifying unique taxa:

# Create dataframe
occdf <- data.frame(species = c("rex", "aegyptiacus", NA),
                    genus = c("Tyrannosaurus", "Spinosaurus", NA),
                    family = c("Tyrannosauridae", "Spinosauridae", "Diplodocidae"))
# Retain unique taxa
dinosaur_species <- tax_unique(occdf = occdf,
                               species = "species",
                               genus = "genus",
                               family = "family",
                               resolution = "species")
head(dinosaur_species)
##            family         genus           genus_species             unique_name
## 1   Spinosauridae   Spinosaurus Spinosaurus aegyptiacus Spinosaurus aegyptiacus
## 2 Tyrannosauridae Tyrannosaurus       Tyrannosaurus rex       Tyrannosaurus rex
## 3    Diplodocidae          <NA>                    <NA>     Diplodocidae indet.

Temporal range

Calculate and plot temporal range of taxa:

# Grab tetrapod data
occdf <- tetrapods
# Remove NAs
occdf <- subset(occdf, !is.na(order))
# Temporal range
ex <- tax_range_time(occdf = occdf, name = "order", plot = TRUE)

Geographic range

Four approaches to calculate geographic range of taxa:

# Grab internal data
occdf <- tetrapods
# Remove NAs
occdf <- subset(occdf, !is.na(genus))
# Convex hull
ex1 <- tax_range_space(occdf = occdf, name = "genus", method = "con")
# Latitudinal range
ex2 <- tax_range_space(occdf = occdf, name = "genus", method = "lat")
# Great Circle Distance
ex3 <- tax_range_space(occdf = occdf, name = "genus", method = "gcd")
# Occupied grid cells
ex4 <- tax_range_space(occdf = occdf, name = "genus", method = "occ", spacing = 250)
# See first few rows
head(ex2)
##                taxon taxon_id max_lat min_lat range_lat
## 1           Abajudon        1 -10.624 -16.524       5.9
## 2          Abdalodon        2 -31.925 -31.925       0.0
## 3        Abyssomedon        3  34.776  34.776       0.0
## 4   Acanthostomatops        4  51.000  51.000       0.0
## 5          Acerastea        5 -24.833 -24.833       0.0
## 6 Acerosodontosaurus        6 -24.000 -24.000       0.0

Temporal pseudo-occurrences

Convert range data to bin-level pseudo-occurrences:

# Generate example df
taxdf <- data.frame(name = c("A", "B", "C"),
                    max_age = c(150, 60, 30),
                    min_age = c(110, 20, 0))
# Generate pseudo-occurrences
ex1 <- tax_expand_time(taxdf = taxdf, max_ma = "max_age", min_ma = "min_age")
# Increase output width
options(width = 200)
# See first few rows
head(ex1)
##   name max_age min_age   ext  orig bin interval_name  rank max_ma mid_ma min_ma duration_myr abbr  colour  font
## 1    A     150     110 FALSE  TRUE  66     Tithonian stage  152.1 148.55  145.0          7.1   Ti #D9F1F7 black
## 2    A     150     110 FALSE FALSE  67    Berriasian stage  145.0 142.40  139.8          5.2   Be #8CCD60 black
## 3    A     150     110 FALSE FALSE  68   Valanginian stage  139.8 136.20  132.6          7.2   Va #99D36A black
## 4    A     150     110 FALSE FALSE  69   Hauterivian stage  132.6 131.00  129.4          3.2   Ha #A6D975 black
## 5    A     150     110 FALSE FALSE  70     Barremian stage  129.4 127.20  125.0          4.4 Barr #B3DF7F black
## 6    A     150     110 FALSE FALSE  71        Aptian stage  125.0 119.00  113.0         12.0   Ap #BFE48A black

Latitudinal pseudo-occurrences

Convert range data to bin-level pseudo-occurrences:

# Generate latitudinal bins
bins <- lat_bins()
# Generate example df
taxdf <- data.frame(name = c("A", "B", "C"),
                    max_lat = c(60, 20, -10),
                    min_lat = c(20, -40, -60))
# Generate pseudo-occurrences
ex1 <- tax_expand_lat(taxdf = taxdf, bins = bins)
# See first few rows
head(ex1)
##   name max_lat min_lat bin max mid min
## 1    A      60      20   4  60  55  50
## 2    A      60      20   5  50  45  40
## 3    A      60      20   6  40  35  30
## 4    A      60      20   7  30  25  20
## 5    B      20     -40   8  20  15  10
## 6    B      20     -40   9  10   5   0

Phylogeny wrangling

Compare a list of taxonomic names to tip names in a user-provided phylogeny:

# Read in example tree of ceratopsians
# from paleotree
library(paleotree)
data(RaiaCopesRule)
# Set smaller margins for plotting
par(mar = rep(0, 4))
# Plot tree
plot(ceratopsianTreeRaia)

# Specify list of names
dinosaurs <- c("Nasutoceratops_titusi", 
               "Diabloceratops_eatoni",
               "Zuniceratops_christopheri",
               "Psittacosaurus_major")

# Table of taxon names in list, tree or both
ex1 <- phylo_check(tree = ceratopsianTreeRaia,
                   list = dinosaurs)
# Get first few rows
head(ex1)
##                   taxon_name present_in_tree present_in_list
## 8      Diabloceratops_eatoni            TRUE            TRUE
## 33      Psittacosaurus_major            TRUE            TRUE
## 38     Nasutoceratops_titusi           FALSE            TRUE
## 39 Zuniceratops_christopheri           FALSE            TRUE
## 1       Centrosaurus_apertus            TRUE           FALSE
## 2  Styracosaurus_albertensis            TRUE           FALSE

Interval linking

Link and match interval names to the Geological Time Scale:

## Link numeric age values
# Create exemplary df
occdf <- data.frame(name = c("A", "B", "C"),
                    early_interval = c("Maastrichtian",
                                       "Campanian",
                                       "Sinemurian"),
                    late_interval = c("Maastrichtian",
                                      "Campanian",
                                      "Bartonian"))
# Assign stages and numerical ages
occdf <- look_up(occdf)

## Use exemplary int_key
# Get internal reef data
occdf <- reefs
# Get internal interval key
int_key <- interval_key
# Assign stages and numerical ages
occdf <- look_up(occdf,
                early_interval = "interval",
                late_interval = "interval",
                int_key = int_key)

Plotting

Add Geological Time Scale to plots:

# Plot data
plot(x = 541:0,
     xlab = "Time (Ma)", ylab = "User-variable",
     xlim = c(541, 0), xaxt = "n", type = "l", lwd = 5)

# Add Geological Time Scale
axis_geo(side = 1, intervals = "periods")

Wrapper

Run functions over groups of data:

# Get tetrapod data
occdf <- tetrapods

# Count number of occurrences from each country
ex1 <- group_apply(occdf = occdf, group = "cc", fun = nrow)

# Remove NA data
occdf <- subset(occdf, !is.na(genus))

# Unique genera per collection with group_apply and input arguments
ex2 <- group_apply(occdf = occdf,
                   group = c("collection_no"),
                   fun = tax_unique,
                   genus = "genus",
                   family = "family",
                   order = "order",
                   class = "class",
                   resolution = "genus")

# Use multiple variables (number of occurrences per collection & formation)
ex3 <- group_apply(occdf = occdf,
                   group = c("collection_no", "formation"),
                   fun = nrow)

What’s next?

Onwards and upwards 🏔️…

What’s next?

Palaeobiology CRAN Task View

Shiny App

Workshops and Hackathons

Further package development

Funding

Your involvement!

Thank-you / Merci / Gracias / Danke / Obrigado / Grazie / Ευχαριστώ

Website: General information
- https://palaeoverse.org
Twitter: News and updates
- @ThePalaeoverse
Google Group: A community space
- https://groups.google.com/g/palaeoverse

Point of contact: General contact
- LewisAlan.Jones@uvigo.es
Publication: Open-access
- https://doi.org/10.1111/2041-210X.14099