install.packages("palaeoverse")
Alpha diversity and geographic range size of Carboniferous and Permian tetrapods
Introduction
palaeoverse
is an R package developed by palaeobiologists, for palaeobiologists. The aim of the package is to generate a community-driven toolkit of generic functions for the palaeobiological community. The package does not provide implementations of statistical analyses, rather it provides auxiliary functions to help streamline analyses and improve code readability and reproducibility. This vignette (or tutorial if you prefer) is provided to guide you through the installation process and some of the functionality available within palaeoverse
. To do so, we will focus on a usage example looking at various trends in tetrapods across the Carboniferous/Permian boundary.
Installation
The palaeoverse
package can be installed via CRAN or its dedicated GitHub repository if the development version is preferred. To install via the CRAN, simply use:
To install the development version, first install the devtools
package, and then use install_github
to install palaeoverse
directly from GitHub.
install.packages("devtools")
::install_github("palaeoverse/palaeoverse") devtools
You can now load palaeoverse
using the default library
function:
library(palaeoverse)
Before we get onto the good stuff, the development team has a small request. If you use palaeoverse
in your research, please cite the associated publication. This will help us to continue our work in supporting you to do yours. You can access the appropriate citation via:
citation("palaeoverse")
To cite palaeoverse in publications, use the following citation:
Jones, L.A., Gearty, W., Allen, B.J., Eichenseer, K., Dean, C.D.,
Galván S., Kouvari, M., Godoy, P.L., Nicholl, C.S.C., Dillon, E.M.,
Flannery-Sutherland, J.T., Chiarenza, A.A. 2022. palaeoverse: A
community-driven R package to support palaeobiological analysis.
Methods in Ecology and Evolution. 00, 1--11. doi:
10.1111/2041-210X.14099
A BibTeX entry for LaTeX users is
@Article{,
title = {palaeoverse: A community-driven R package to support palaeobiological analysis},
author = {Lewis A. Jones and William Gearty and Bethany J. Allen and Kilian Eichenseer and Christopher D. Dean and Sofía Galván and Miranta Kouvari and Pedro L. Godoy and Cecily S. C. Nicholl and Lucas Buffan and Erin M. Dillon and Joseph T. Flannery-Sutherland and Alfio Alessandro Chiarenza},
year = {2023},
journal = {Methods in Ecology and Evolution},
pages = {1--11},
doi = {10.1111/2041-210X.14099},
}
Now that that’s out of the way, let’s begin!
The Carboniferous and early Permian have been regarded as critical intervals for early four-limbed vertebrates, in terms of both diversification and biogeography. During this time several major groups of tetrapods emerged (including crown amniotes), and the extensive tropical rainforests of the Carboniferous gave way to dryland vegetation during the Permian, in what’s known as the ‘Carboniferous Rainforest Collapse’ (CRC) (for more information, see Dunne et al., 2018).
In this vignette we’ll use the tetrapods
dataset to demonstrate how to use palaeoverse
functions to examine how alpha diversity and range size vary across this interval.
The Tetrapods Dataset
Functions in palaeoverse
are designed around the assumption that most users’ data are stored in an occurrence data.frame
. As such, this is the expected input data format for most functions. We provide two example occurrence datasets (tetrapods
and reefs
) within the package from different data sources. While these two datasets have similar structures, the data (variables) available in each differ. The tetrapods
dataset is a compilation of Carboniferous–Early Triassic tetrapod occurrences (n = 5,270) from the Paleobiology Database. We’ll be using this dataset to demonstrate some of the available functionality in palaeoverse
.
Let’s get started by exploring the dataset.
# Call the dataset:
data(tetrapods)
# View the column names and the first five lines of the first three columns:
colnames(tetrapods)
[1] "occurrence_no" "collection_no" "identified_name"
[4] "identified_rank" "accepted_name" "accepted_rank"
[7] "early_interval" "late_interval" "max_ma"
[10] "min_ma" "phylum" "class"
[13] "order" "family" "genus"
[16] "abund_value" "abund_unit" "lng"
[19] "lat" "collection_name" "cc"
[22] "formation" "stratgroup" "member"
[25] "zone" "lithology1" "environment"
[28] "pres_mode" "taxon_environment" "motility"
[31] "life_habit" "diet"
1:5, 1:3] tetrapods[
occurrence_no collection_no identified_name
1 146665 12943 n. gen. Anconastes n. sp. vesperus
2 147090 13004 Gorgonopsia indet.
3 147521 13043 n. gen. Procolophon n. sp. trigoniceps
4 147523 13044 n. gen. Ophiodeirus n. sp. casei
5 147563 13046 Broiliellus n. sp. arroyoensis
As you can see, we have a variety of variables to play around with! For this vignette we’ll mostly be focusing on some of the spatial information, such as latitude and longitude. But first, we need to do some data prep…
Exploring the functions
If we’re looking at patterns of tetrapod diversity and distribution over the Carboniferous and Permian, we’ll first have to bin our occurrences by time and by space. We can do this easily with a few functions in the palaeoverse
package. Let’s first look at binning by time.
There are two palaeoverse
functions that help us to assign occurrences to different intervals, or ‘bins’, of time. First, we’ll use time_bins
to make some suitable time bins for our study. The function requires a maximum and a minimum time interval to run. As we know we’re interested in looking at the Carboniferous and the Permian, we can use these as our intervals to get the suitable stage-level bins that we’re after:
# Generate time bins
<- time_bins(interval = c("Carboniferous", "Permian"),
bins rank = "stage",
scale = "GTS2012",
plot = FALSE)
head(bins)
bin interval_name rank max_ma mid_ma min_ma duration_myr abbr colour font
1 1 Tournaisian stage 358.9 352.80 346.7 12.2 T #8CB06C black
2 2 Visean stage 346.7 338.80 330.9 15.8 V #A6B96C black
3 3 Serpukhovian stage 330.9 327.05 323.2 7.7 Se #BFC26B black
4 4 Bashkirian stage 323.2 319.20 315.2 8.0 B #99C2B5 black
5 5 Moscovian stage 315.2 311.10 307.0 8.2 M #B3CBB9 black
6 6 Kasimovian stage 307.0 305.35 303.7 3.3 Ka #BFD0C5 black
Before we can bin our occurrences into these time bins, we will need the ages of the occurrences in millions of years. Since the numeric ages may have changed since we downloaded the data from the PBDB, we’ll use look_up
to get the numeric ages for the occurrences within tetrapods
dataset based on their assigned interval names:
# Get new numeric ages for named intervals using the
# interval key that is supplied with Palaeoverse
<- look_up(tetrapods, int_key = interval_key)
tetrapods
# Make sure that any values which could not be matched contain their original values
$interval_max_ma <- ifelse(is.na(tetrapods$interval_max_ma),
tetrapods$max_ma,
tetrapods$interval_max_ma)
tetrapods$interval_min_ma <- ifelse(is.na(tetrapods$interval_min_ma),
tetrapods$min_ma,
tetrapods$interval_min_ma)
tetrapods$interval_mid_ma <- (tetrapods$min_ma + tetrapods$max_ma)/2
tetrapods
# Look at the updated dataset
1:5, 35:37] tetrapods[
interval_max_ma interval_mid_ma interval_min_ma
1 307.000 302.400 298.90
2 265.100 262.500 259.10
3 251.902 249.685 247.20
4 290.100 284.700 283.50
5 283.500 275.800 272.95
As you can see we’ve got some new columns with updated interval ages. Handy!
Now we’re ready to bin the occurrences into these time bins using bin_time
. For now, we’ll go with the ‘mid’ method, but we’ll talk about this more in a minute.
# Rename columns so that new intervals are used for binning
colnames(tetrapods)[9:10] <- c("old_max_ma", "old_min_ma")
colnames(tetrapods)[c(35, 37)] <- c("max_ma", "min_ma")
# Generate time bins
<- bin_time(occdf = tetrapods,
tetrapods bins = bins,
method = 'mid')
Error in bin_time(occdf = tetrapods, bins = bins, method = "mid"): Minimum age of occurrence data is less than minimum age of bins.
Oh no! We’ve hit an error. This is because there’s some occurrences that sit outside of the time intervals we’re interested in. bin_time
can’t bin occurrences which sit outside of the time intervals you’ve chosen, so for it to work, we’ll first have to remove these occurrences.
# Remove occurrences that are younger than the time intervals we're interested in
<- subset(tetrapods, min_ma > min(bins$min_ma))
cp_tetrapods
# Bin occurrences into chosen time bins
<- bin_time(occdf = cp_tetrapods,
mid_tetrapods bins = bins,
method = 'mid')
Warning in bin_time(occdf = cp_tetrapods, bins = bins, method = "mid"): One or
more occurrences have a midpoint age equivalent to a bin boundary. Binning
skipped for these occurrences. Hint: `which(is.na(occdf$bin_assignment))`.
That’s better! Let’s see how our dataset has changed.
# Examine the dataset columns
colnames(mid_tetrapods)
[1] "occurrence_no" "collection_no" "identified_name"
[4] "identified_rank" "accepted_name" "accepted_rank"
[7] "early_interval" "late_interval" "old_max_ma"
[10] "old_min_ma" "phylum" "class"
[13] "order" "family" "genus"
[16] "abund_value" "abund_unit" "lng"
[19] "lat" "collection_name" "cc"
[22] "formation" "stratgroup" "member"
[25] "zone" "lithology1" "environment"
[28] "pres_mode" "taxon_environment" "motility"
[31] "life_habit" "diet" "early_stage"
[34] "late_stage" "max_ma" "interval_mid_ma"
[37] "min_ma" "id" "n_bins"
[40] "bin_assignment" "bin_midpoint"
As you can see, four new columns (variables) have been added to our dataset. Two of these are especially important. First is the bin_assignment
, which shows which number bin the occurrence has been added to. Second is n_bins
- this shows how many bins the occurrence could potentially have been placed within. Let’s examine this further.
# Make a table of potential number of bins
table(mid_tetrapods$n_bins) # raw counts
1 2 3 4 5 6 7 8 9
843 1754 425 268 33 34 3 2 16
table(mid_tetrapods$n_bins) / nrow(mid_tetrapods) # proportions
1 2 3 4 5 6
0.2495559503 0.5192421551 0.1258140912 0.0793368857 0.0097690941 0.0100651273
7 8 9
0.0008880995 0.0005920663 0.0047365305
As you can see, only ~25% of the dataset can be assigned to just one bin - the other occurrences can’t be confidently assigned to an individual time interval. This is where the other methods for the bin_time
function come in handy. They provide the user with different options when assigning occurrences to bins. Instead of the ‘mid’ method, let’s try the ‘majority’ method instead, which places the occurrences in the bin which covers the majority of their potential time range.
# Bin occurrences into chosen time bins
<- bin_time(occdf = cp_tetrapods,
maj_tetrapods bins = bins,
method = 'majority')
We’d recommend playing around with these different options and seeing how they impact your results!
Next, we’ll also bin our occurrences spatially. We need a couple of functions to do this. Firstly, we need to add palaeocoordinates to our dataset. We can use the palaeorotate
function to do this.
# Palaeorotate occurrences
<- palaeorotate(occdf = maj_tetrapods, age = "bin_midpoint",
maj_tetrapods method = "point", model = "PALEOMAP")
Now let’s check that it’s worked…
# Check palaeocoordinates
head(maj_tetrapods[, c("p_lng", "p_lat")])
p_lng p_lat
1 -39.1618 2.9214
2 -34.0923 -59.1326
4 -29.7188 -0.1250
5 -27.9228 3.6845
6 -29.0469 -0.2730
7 -16.5579 -6.0513
Looks good to me! Now that we have palaeocoordinates for our occurrences, we can put them into spatial bins. We can do this using the bin_space
function, which assigns each occurrence to one of many discrete equal-area hexagonal grid cells (based on Uber’s H3 system).
# Make a table of potential number of bins
<- bin_space(occdf = maj_tetrapods,
maj_tetrapods lng = 'p_lng',
lat = 'p_lat',
spacing = 100)
Average spacing between adjacent cells in the primary grid was set to 103.6 km.
H3 resolution: 3
# Show the first 6 rows of the new columns added using `bin_space`.
head(maj_tetrapods[, c("cell_ID", "cell_centroid_lng", "cell_centroid_lat")])
cell_ID cell_centroid_lng cell_centroid_lat
1 8380d4fffffffff -38.83199 3.04111345
2 83dd68fffffffff -34.66666 -59.09848198
4 837c65fffffffff -29.87643 -0.04859754
5 837c40fffffffff -27.94272 3.31086571
6 837c64fffffffff -28.90531 -0.63974717
7 837da3fffffffff -16.16673 -6.15422544
bin_space
adds some new columns, including one with the unique grid cell ID, as well as latitudinal and longitudinal coordinates for the centroid of that cell.
Now that we have both temporal and spatial bins, we can use these data to see how the spatial coverage of tetrapod fossils in our dataset varies through time.
# Extract unique interval midpoints
<- sort(unique(maj_tetrapods$bin_midpoint))
midpoints
# Calculate the number of unique cells in each time bin
<- unique(maj_tetrapods[, c("bin_midpoint", "cell_ID")])
unique_cells <- group_apply(unique_cells, group = "bin_midpoint", fun = nrow) spat.cov
Let’s save that for now and move on to some of our analyses!
Unique taxa and alpha diversity
In many cases palaeobiologists count unique taxa by retaining only unique occurrences identified to a given taxonomic resolution (e.g. genera or species); however, this can result in some useful information being discarded. The tax_unique
function retains occurrences identified to a coarser taxonomic resolution which are not already represented within the dataset. Take, for example, the following occurrences:
- Albertosaurus sarcophagus
- Ankylosaurus sp.
- Ceratopsidae indet.
- Ornithominus sp.
- Tyrannosaurus rex
A filter for species-level identifications would reduce the species richness to just two (A. sarcophagus and T. rex). However, none of these clades are nested within one another, so each of the indeterminately identified occurrences represents at least one species not already represented in the dataset. tax_unique
is designed to deal with such taxonomic data and would retain all seven ‘species’ in this example. For further details see ?tax_unique
.
This function can be especially useful for when looking at alpha diversity (local richness). Let’s apply it to our dataset by filtering it to genus-level.
# Filter to genus-level:
<- tax_unique(occdf = maj_tetrapods, genus = "genus", family = "family",
tet_genera order = "order", class = "class", resolution = "genus")
We can compare this to the genera from the original dataset to see how many genera were added by using tax_unique()
:
# How many unique genera before:
length(unique(maj_tetrapods$genus))
[1] 717
# How many unique genera after:
length(unique(tet_genera$unique_name))
[1] 719
As you can see we’ve added two taxa that otherwise would have been ignored! To see more specifically how the function has performed, take a look at row 718, which records an indeterminate occurrence of the family Dicynodontidae. This occurrence now has a unique name of ‘Dicynodontidae indet.’
# See just row 718
718, ] tet_genera[
class order family genus unique_name
718 Osteichthyes Dicynodontia Dicynodontidae <NA> Dicynodontidae indet.
Now, let’s use this function to extract the unique genera for each collection (=locality) with group_apply
:
# Use group_apply along with tax_unique:
<- group_apply(occdf = maj_tetrapods,
coll_genera group = c("collection_no"),
fun = tax_unique,
genus = "genus",
family = "family",
order = "order",
class = "class",
resolution = "genus")
# See structure of the new dataframe:
str(coll_genera)
'data.frame': 2853 obs. of 6 variables:
$ class : chr "Osteichthyes" "Osteichthyes" "Osteichthyes" "Reptilia" ...
$ order : chr NA "Temnospondyli" "Cotylosauria" NA ...
$ family : chr NA "Dissorophidae" NA "Bolosauridae" ...
$ genus : chr NA "Parioxys" "Archeria" "Bolosaurus" ...
$ unique_name : chr "Osteichthyes indet." "Parioxys sp." "Archeria sp." "Bolosaurus sp." ...
$ collection_no: chr "11149" "11152" "11165" "11165" ...
Next, let’s count the number of unique genera per collection:
# Get the names of unique genera per collection
<- unique(coll_genera[, c("unique_name", "collection_no")])
unique_genera
# Calculate the number of unique genera per collection
<- group_apply(unique_genera, group = "collection_no", fun = nrow)
coll_taxa
# Rename column names:
colnames(coll_taxa) <- c("n_taxa", "collection_no")
# Take a quick look at the new dataframe
head(coll_taxa)
n_taxa collection_no
1 1 104796
2 1 106270
3 3 106558
4 8 108766
5 3 109702
6 1 11149
To plot an alpha diversity (local richness) plot, we will need to add interval age data to these collections:
# Take the columns pertaining to collections and their ages in millions of years:
<- maj_tetrapods[, c("collection_no", "max_ma", "interval_mid_ma", "min_ma")]
coll_info
# Remove duplicated collections based on the collection number (column 1)
<- coll_info[!duplicated(coll_info[1]),]
coll_info
# Combine this dataframe with the dataframe from above
<- merge(coll_info, coll_taxa, by = "collection_no")
alpha_data
# Take a look:
head(alpha_data)
collection_no max_ma interval_mid_ma min_ma n_taxa
1 11149 290.10 284.70 283.50 1
2 11152 293.52 283.90 272.95 1
3 11165 298.90 294.50 290.10 9
4 11323 303.70 301.15 293.52 1
5 12943 307.00 302.40 298.90 1
6 12976 283.50 275.80 272.95 7
Done. Now to plot alpha diversity through time! Let’s start with a simple scatterplot:
# Make a plot for alpha diversity in the Carboniferous and Permian
plot(alpha_data$interval_mid_ma, alpha_data$n_taxa, # add points
xlim = rev(range(alpha_data$interval_mid_ma, na.rm = TRUE)), # reverse x-axis
xlab="Time (Ma)", ylab="No. taxa", # axes labels
pch = 19, col = "#0e826f") # point style and colour
That looks good, but it’s a bit difficult to figure out what stages contain which data points. axis_geo()
adds a geological timescale between the plot and the axis.
Taking our plot from above, let’s use this function to add a timescale to the x-axis:
# Set margins
par(mar = c(7.1, 4.1, 4.1, 2.1)) # expand at bottom if adding more interval categories
# Make the final alpha diversity plot
plot(alpha_data$interval_mid_ma, alpha_data$n_taxa, # add the points
axes = FALSE,
xlim = rev(range(alpha_data$interval_mid_ma, na.rm = TRUE)), # reverse the x-axis
xlab = " ", ylab = "No. taxa", # axes labels
pch = 19, col = "#0e826f") # point style and colour
box()
axis(2) # add a normal axis to the elft side
axis_geo(side = 1, intervals = list("epochs", "periods"))
title(xlab = "Time (Ma)", line = 4)
Amazing! Now we can clearly see that alpha diversity appears to be a lot higher in the Permian than the Carboniferous. But how has the spatial distribution of tetrapods changed across this boundary? Let’s use some of the other palaeoverse
functions to find out!
Exploring range size/spatial issues through time
Another factor that we can examine is how range size of taxa changed across the C/P boundary. We can examine this spatial distribution through time using the tax_range_space
function, which provides several different methods for estimating the distribution of taxonomic occurrences. For now we’ll use the method con
which calculates the geographic range of taxa using a convex hull, but the rest of the methods are described in detail in the function documentation which can be accessed via ?tax_range_space
.
We can use this in combination with group_apply
to calculate the average for each time interval for our dataset:
# First, remove any occurrences without a genus
<- subset(maj_tetrapods, !is.na(genus))
space_tetrapods
# Find temporal range of all genera
<- group_apply(occdf = space_tetrapods,
space_tetrapods group = c("bin_midpoint"),
fun = tax_range_space,
name = "genus",
lng = "p_lng",
lat = "p_lat",
method = "con")
# Have a look at the dataset
head(space_tetrapods)
taxon taxon_id area bin_midpoint
1 Abdalodon 1 0.00 257.05
2 Acrodenta 2 0.00 257.05
3 Adelosaurus 3 0.00 257.05
4 Aelurognathus 4 292392.77 257.05
9 Aelurosaurus 5 14698.41 257.05
14 Aenigmastropheus 6 0.00 257.05
As you can see, the combination of group_apply
and tax_range_space
produces a dataset with the area of geographic range for each taxon in kilometers squared for each time interval. Note that an area of zero here means that there is only one occurrence of that taxon in that time bin. From this information we can now plot how the average range size of tetrapods changed throughout the Carboniferous and Permian:
# Find the average geographic range size for each time interval
<- group_apply(space_tetrapods, "bin_midpoint", function(df) mean(df$area))
space_tetrapods_mean colnames(space_tetrapods_mean) <- c("mean_area", "bin_midpoint")
# Create a plot of average range size through time
plot(space_tetrapods_mean$bin_midpoint, space_tetrapods_mean$mean_area, # add the points
axes = FALSE,
xlim = rev(range(as.numeric(space_tetrapods_mean$bin_midpoint), na.rm = TRUE)), # reverse the x-axis
xlab = " ", ylab = expression("Average range size of tetrapod genera (km"^2~")"), # axes labels
pch = 19, col = "#0e826f") # point style and colour
box()
axis(2) # add a normal axis to the left side
axis_geo(side = 1, intervals = list("epochs", "periods"))
title(xlab = "Time (Ma)", line = 4)
From this plot it seems like the average geographic range size of tetrapods increased during the Permian. However, we first need to be sure we’re not picking up on another pattern. Remember how we calculated the number of unique geographic cells in each time interval earlier? This provides a measure of the spatial spread of the fossil record. It might be that we’re not picking up on an increased range size but just a larger geographic spread of fossil bearing rocks within the Permian. To see if this is the case, we can run a simple correlation test between the average geographic range size of tetrapods and the number of unique grid cells in each time interval:
# Run a correlation test between mean range size and the spatial spread of cells
# through time
cor.test(log10(space_tetrapods_mean$mean_area), log10(spat.cov$nrow), method = "spearman")
Spearman's rank correlation rho
data: log10(space_tetrapods_mean$mean_area) and log10(spat.cov$nrow)
S = 226.12, p-value = 0.01898
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.5962179
The high rho value and a p-value less than our cut off for significance (0.05) indicate a strong and statistically significant correlation between average geographic range size and the spatial spread of data. Seems like this work is far from over…
…but we’ll leave the rest to you!
Hopefully this vignette has shown you the potential uses for palaeoverse
functions and helped provide a workflow for your own analyses. If you have any questions about the package or its functionality, please feel free to join our Palaeoverse Google group and leave a question; we’ll aim to answer it as soon as possible!
References
Dunne, E.M., Close, R.A., Button, D.J., Brocklehurst, N., Cashmore, D.D., Lloyd, G.T. and Butler, R.J., 2018. Diversity change during the rise of tetrapods and the impact of the ‘Carboniferous rainforest collapse’. Proceedings of the Royal Society B: Biological Sciences, 285(1872), 20172730.