Setophaga example

Philip Mostert

2024-02-02

Introduction

Predicting the distribution of species across space and time is a fundamental piece for answering a multitude of questions surrounding the field of ecology. Consequentially, numerous statistical tools have been developed to assist ecologists in making these types of predictions; the most common of which is certainly species distribution models (SDMs).

These types of models are heavily reliant on the quality and quantity of data available across multiple time periods in order to produce meaningful results. Thankfully, the \(21^{st}\) century has been associated with a boom in digital technology and the collapse of data storage costs achieved, allowing us to see data on the environment and species occurrences grow at unprecedented rates. The problem however is that all these data are disparate in nature; they all have their own sampling protocols, assumptions and attributes, making joint analysis of them difficult.

On account of this issue, the field of integrated species distribution modelling has progressed considerably over the last few years – and now the associated methods are well-developed, and the benefits of using such models (over single dataset or simple data-pooling models) are apparent.

A major handicap to integrated modeling lies in the fact that the tools required to make inference with these models have not been established – especially with regards to general software to be used by ecologists. This in turn has stagnated the growth of applying these models to real data.

In light of this impediment to the field of integrated modelling, we decided to make PointedSDMs, an R package to help simplify the modelling process of integrated species distribution models (ISDMs) for ecologists. It does so by using the INLA framework (Rue, Martino, and Chopin 2009) to approximate the models and by constructing wrapper functions around those provided in the R package, inlabru (Bachl et al. 2019).

This R markdown file illustrates an example of modelling an ISDM with PointedSDMs, using five disparate datasets containing species Setophaga collected around Pennsylvania state (United States of America). The first part of this file contains a brief introduction to the statistical model used in ISDMs, as well as an introduction to the included functions in the package. The second part of this file uses PointedSDMs to run the ISDM for the provided data. We note that, for this particular vignette, no inference was made due to computational intensity of the model. However the R script and data are provided below so that the user may carry out inference.


##Install if need be
library(PointedSDMs)

Statistical model

The goal of our statistical model is to infer the “true” distribution of our species’ using the observations we have at hand, as well as environmental covariates describing the studied area, \(\boldsymbol{X}\), and other parameters describing the species, \(\boldsymbol{\theta}\). To do so, we construct a hierarchical state-space model composing of two parts: the underlying process model as well as observation models.

The process model is a stochastic process which describes how the points are distributed in space. The most commonly assumed process model is the Log-Gaussian Cox process (LGCP), which is characterized by the intensity function \(\lambda(s)\): such that a higher a higher intensity at a location implies the species is more abundant there, as well as a Gaussian random field used to account for all the environmental covariates not included in the model.

The observation models give a statistical description of the data collection process, and are chosen for each dataset based on their underlying sampling protocols (see (Isaac et al. 2020) for a detailed review of the types of observation models typically used in integrated models). As a result, we can write the likelihood of the observation model for a given dataset, \(Y_i\) as \(\mathcal{L} \left( Y_i \mid \lambda(s), \theta_i \right)\).

Therefore, given a collection of heterogeneous datasets \(\{Y_1, Y_2,...,Y_n\}\), the full-likelihood of the statistical process is given by:\[\mathcal{L} \left( \boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{\theta}, \boldsymbol{\phi} \right) = p\left( \lambda(s), \boldsymbol{X}, \boldsymbol{\phi} \right) \times \prod_{i=1}^n \mathcal{L} \left(Y_i \mid \lambda(s), \theta_i\right),\]where \(\boldsymbol\phi\) are parameters assumed for the underlying process.

Package use

PointedSDMs is designed to simplify the modelling process of integrated species distribution models by using wrapper functions around those found in R-INLA and inlabru. The simplification is regarding four key steps in the statistical-modelling process:

The first function of interest with regards to these steps is intModel, which is designed to create objects required by inlabru, assign the relevant information to individual objects used in model specification, as well as structure the provided datasets together into a homogenized framework.


args(intModel)

The output of this function is an R6 object, and so there are several slot functions included to help specify the model at a finer scale. Use ?dataSDM to get a comprehensive description of these slot functions.

The next function used is fitISDM which takes the data object created with intModel and runs the integrated model. The output of this function is in essence an inlabru model output with additional information to be used by the sequential functions.


args(fitISDM)

Spatial blocked cross-validation may be performed using the blockedCV function – which iteratively calculates and averages the deviance information criteria (DIC) for models run without data from the selected spatial block. Before this function is used, .$spatialBlock() from the object produced by intModel is required in order to specify how the data should be spatially blocked (see below for an example).


args(blockedCV)

datasetOut provides cross-validation for integrated models by running the full model with one less dataset, and calculating a score measuring the relative importance of the left out dataset in the full model.


args(datasetOut)

Finally prediction of the full model may be completed using the predict function; the arguments here are mostly related to specifying which components are required to be predicted, however the function can act identically to the inlabru predict function if need be. After predictions are complete, plots may be made using the generic plot function.

Setophaga example

This example aims to predict the distribution of three species of genus setophaga across Pennsylvania state. This example is notable in integrated modelling since it has been used by two seminal papers in the field, namely those by: Isaac et al. (2020) and Miller et al. (2019). This file extends the example by adding two additional datasets containing a further two species.

Model preparations

The first step in our analysis is to load in the packages required.


library(INLA)
library(inlabru)
library(USAboundaries)
library(sp)
library(sf)
library(blockCV)
library(ggmap)
library(sn)
library(terra)
library(RColorBrewer)
library(cowplot)
library(knitr)
library(kableExtra)
library(dplyr)
library(spocc)

Finally, additional objects required by PointedSDMs and the R-INLA (Martins et al. 2013) and inlabru (Bachl et al. 2019) packages need to be assembled.

A SpatialPolygons object of Pennsylvania (obtained using the USAboundaries (rOpenSci 2018) package) is the first of these objects required, and it will be used to construct an inla.mesh object as well as help with the plotting of graphics. In order to create one of these polygons objects, we are first required to define the projection reference system (see ?CRS ).


proj <- '+proj=longlat +datum=WGS84 +no_defs'
PA <- USAboundaries::us_states(states = "Pennsylvania")
#PA <- PA$geometry
PA <- as(PA, "sf")
st_crs(PA) <- proj

Species occurrence data

For our analysis, we used five datasets containing the geocoded locations of our studied species’. The first three of these datasets are present only datasets obtained via eBird through the spocc package (Chamberlain 2021), which is a toolkit used to obtain species’ observation data from a selection of popular online data repositories (GBIF, Vernet, BISON, iNaturalist and eBird). These data were obtained using the following script:


species <- c('caerulescens', 'fusca', 'magnolia')

dataSets <- list()
for (bird in species) {
  
  raw_data <- spocc::occ(
              query = paste('Setophaga', bird),
              from = "gbif",
              date = c("2005-01-01", "2005-12-31"),
              geometry = PA)$gbif
  
  rows <- grep("EBIRD", raw_data$data[[paste0('Setophaga_', bird)]]$collectionCode)

  raw_data <- data.frame(raw_data$data[[1]][rows, ])
  raw_data$Species_name <- rep(bird, nrow(raw_data))
  
  data_sp <- st_as_sf(
    x = raw_data[, names(raw_data) %in% c("longitude", "latitude", 'Species_name')],
    coords = c('longitude', 'latitude'),
    crs = proj)

  dataSets[[paste0('eBird_', bird)]] <- data_sp[unlist(st_intersects(PA, data_sp)),]
  
  }

The PointedSDMs package also contains two additional structured datasets (BBA, BBS), which were both assumed to be presence absence datasets in the study conducted by Isaac et al. (2020), containing the variable, NPres, denoting the presence (or number of presences in the BBS dataset) at each sampling location. However, we changed the NPres variable name of the BBS dataset to Counts in order to consider it a counts dataset for illustrative purposes. No additional changes were made to the BBA dataset, and so it was considered a presence absence dataset as originally intended. These datasets may be loaded using the following script:


data('SetophagaData')
dataSets[['BBA']] <- SetophagaData$BBA
dataSets[['BBS']] <- SetophagaData$BBS
Dataset name Sampling protocol Number of observations Species name Source
BBS Counts 45 Caerulescens North American Breeding Bird Survey
BBA Detection/nondetection 5165 Caerulescens Pennsylvania Breeding Bird Atlas
eBird_caerulescens Present only 264 Caerulescens eBird
eBird_magnolia Present only 354 Magnolia eBird
eBird_fusca Present only 217 Fusca eBird

Table 1: Table illustrating the different datasets used in the analysis.

Covariate data

Species distribution models study the relationship between our in-situ species observations and the underlying environment, and so in line with the study conducted by Isaac et al. (2020), we considered two covariates: elevation, describing the height above sea level (meters) and canopy, describing the proportion of tree canopy covered in the area. These two covariates were obtained from the Hollister and Shah (2017) and Bocinsky et al. (2019) R packages respectively. The script to obtain these spatial covariates is provided in the same github repository as the species occurrence data above; the only thing we did differently was convert the covariates to a spatRaster object.


covariates <- scale(terra::rast(system.file('extdata/SetophagaCovariates.tif', 
                                      package = "PointedSDMs")))
names(covariates) <- c('elevation', 'canopy')
crs(covariates) <- proj

Additional objects

The R-INLA package requires a Delaunay triangulated mesh used to approximate our spatial random fields, which is created by supplying the max.edge, offset and cutoff arguments as well as our SpatialPolygons map of Pennsylvania to the inla.mesh.2d function. With this mesh, intModel will create integration points required by our model using inlabru’s fm_int function.


mesh <- inla.mesh.2d(boundary = inla.sp2segment(PA), 
                     cutoff = 0.2,
                     max.edge = c(0.1, 0.24), 
                     offset = c(0.1, 0.4),
                     crs = st_crs(proj))

mesh_plot <- ggplot() +
             gg(mesh) +
             ggtitle('Plot of mesh') +
             theme_bw() +
             theme(plot.title = element_text(hjust = 0.5))
mesh_plot
intModel

The intModel function is the first step in the modeling of our species data. The aim of the function is to organize and standardize our data to ensure that each process is correctly modeled in the integrated framework. For this standardization to work, we specify the response variable names of the counts data (responseCounts) and present absence (responsePA), the coordinate names (Coordinates) as well as the coordinate reference system used (Projection). In addition, the species variable name (speciesName) was also specified in order to model the effects for the species found in the datasets.


spatial_data <- intModel(dataSets,
                        Coordinates = c('X', 'Y'), 
                        Projection = proj, Mesh = mesh,
                        responsePA = 'NPres', responseCounts = 'Counts',
                        spatialCovariates = covariates, speciesName = 'Species_name')

As mentioned above, there are some slot functions now inside the spatial_data objects, which allow for a finer level of customization of the components for the integrated model.

.$plot

Let’s have a look at a graphical representation of the species’ distribution, using the .$plot() function within spatial_data.


spatial_data$plot(Boundary = FALSE) +
  geom_sf(data = PA, fill = 'black', alpha = 0.15, lwd = 1.2) +
  theme_bw() +
  ggtitle('Plot of the datasets') + 
  theme(plot.title = element_text(hjust = 0.5))

This will create a plot of the location of the species based on their dataset, however if we wanted to plot by species, we could specify this with the Species = TRUE argument.


spatial_data$plot(Species = TRUE, Boundary = FALSE) +
  geom_sf(data = PA, fill = 'black', alpha = 0.15, lwd = 1.2) +
  theme_bw() +
  ggtitle('Plot of the species') + 
  theme(plot.title = element_text(hjust = 0.5))
.$specifySpatial

When conducting Bayesian analysis, the choice of prior distribution is imperative. In our example, we chose Penalizing complexity priors (Simpson et al. 2017), which are designed to control the spatial range and marginal standard deviation in the GRF’s Matérn covariance function in order to reduce over-fitting in the model. This can be done with ease using the .$specifySpatial function, which will also update all the components associated with this field automatically.


spatial_data$specifySpatial(sharedSpatial = TRUE,
                            prior.sigma = c(5, 0.01),
                            prior.range = c(1, 0.01))
.$addBias

Presence only datasets (such as our eBird data) are often noted to contain numerous biases, which in turn can heavily skew results obtained from our model. A common approach to account for these biases is by using a variable in the model such as sampling effort (some examples include: distance traveled, number of observations per visit, duration spent observing). However in the absence of such variables, running a second spatial field in the model can improve performance of the integrated model (Simmonds et al. 2020). To add a second spatial field to our model, we use the .$addBias function, which is part of the intModel object created above.


spatial_data$addBias('eBird_caerulescens')
spatial_data$addBias('eBird_fusca')
spatial_data$addBias('eBird_magnolia')
.$priorsFixed

Suppose we knew a priori what the mean and precision values of one of the fixed terms for a given species was: we could add this knowledge to the model using the .$priorsFixed function.


spatial_data$priorsFixed(Effect = 'elevation', Species = 'fusca',
                         mean.linear = 2, prec.linear = 0.05)
.$changeComponents

If we would like to see what the components required by inlabru in our model are, we can use the .$changeComponents function with no arguments specified.


spatial_data$changeComponents()

This model is thus specified with: two environmental covariates, an intercept term per species, a spatial field per species, the three bias fields for the presence only data as well as a shared spatial field between the different datasets. If we wanted to add/change or remove a component entirely, this could be done with the addComponent and removeComponent argument respectively.

.$spatialBlock

.$spatialBlock is used to set up spatial blocked cross-validation for the model by assigning each point in the datasets a block based on where the point is located spatially. For this example, we chose four blocks (k=4) for our model, based around a 2x2 grid (rows = 2, cols = 2). See the figure below for an illustration of the spatial block: the amount of data within each block appears reasonably equal.


spatial_data$spatialBlock(k = 4, rows_cols = c(2, 2), plot = TRUE) + theme_bw()
blockedCV

Spatial blocked cross-validation may then be produced using the blockedCV function. We specify control.inla = list(int.strategy = 'eb') in our options argument in order to speed up the model running time.


spatialBlocked <- blockedCV(data = spatial_data, options = list(control.inla = list(int.strategy = 'eb')))

spatialBlocked

More so, we can compare the cross-validation score from this model to one without the shared spatial effect (specified with pointsSpatial = FALSE in intModel).


no_fields <- intModel(dataSets,
                      Coordinates = c('X', 'Y'), 
                      pointsSpatial = NULL,
                      Projection = proj, Mesh = mesh,
                      responsePA = 'NPres', responseCounts = 'Counts',
                      spatialCovariates = covariates, speciesName = 'Species_name')

no_fields$spatialBlock(k = 4, rows = 2, cols = 2)

spatialBlocked_no_fields <- blockedCV(data = no_fields, options = list(control.inla = list(int.strategy = 'eb')))

spatialBlocked_no_fields

Based on the DIC scores, we conclude that the model with the shared spatial field provides a better fit of the data.

fitISDM

The integrated model is easily run with the fitISDM function as seen below by supplying the occurrence and R-INLA objects (created above with intModel).


joint_model <- fitISDM(data = spatial_data, 
                       options = list(control.inla = list(int.strategy = 'eb')))

results_plot <- joint_model$summary.fixed %>%
                mutate(species = gsub('_.*$','',
                                      row.names(joint_model$summary.fixed))) %>%
                mutate(coefficient = row.names(joint_model$summary.fixed))


coefficient_plot <- ggplot(results_plot, aes(x = coefficient, y = mean)) +
                    geom_hline(yintercept = 0, colour = grey(0.25), lty = 2) +
                    geom_point(aes(x = coefficient, 
                                   y = mean)) + 
                    geom_linerange(aes(x = coefficient, 
                                       ymin = `0.025quant`,
                                       ymax = `0.975quant`,
                                       col = species),
                                       lwd = 1) +
                                       theme_bw() +
                    scale_colour_manual(values = c('#003f5c', '#bc5090','#ffa600')) +
                    theme(legend.position="bottom",
                    plot.title = element_text(hjust = 0.5)) +
                    ggtitle("95% credibility intervals of the fixed effects\n
                            for the three studied species") +
                    labs(x = 'Variable', y = 'Coefficient value') +
                    coord_flip()

coefficient_plot
datasetOut

PointedSDMs also includes the function datasetOut, which iteratively omits one dataset (and its associated marks) out of the full model. For example, we can calculate the effect on the other variables by leaving out the dataset BBA, which contributes the most occurrences used in our joint-likelihood model by a significant amount.

Furthermore, by setting predictions = TRUE we are able to calculate some cross-validation score by leaving out the selected dataset. This score is calculated by predicting the covariate values of the left-out using the reduced model (i.e the model with the dataset left out), using the predicted values as an offset in a new model, and then finding the difference between the marginal-likelihood of the full model (i.e the model with all the datasets considered) and the marginal-likelihood of the offset model.


dataset_out <- datasetOut(model = joint_model,
                          dataset = "BBA",
                          predictions = TRUE)

dataset_out
predict

A significant part of SDMs is creating prediction maps to understand the species’ spread. Predictions of the integrated SDM’s from fitISDM are made easy using the predict function. By supplying the relevant components, the predict function will create the formula required by inlabru’s predict function to make predictive maps. In this example, we made predictions for the spatial fields of the species, on the linear scale using 1000 samples.


projections <- predict(joint_model, mesh = mesh, mask = PA, 
                       spatial = TRUE,
                       fun = 'linear', n.samples = 1000)
plot

PointedSDMs also provides a general plotting function to plot the maps using the predicted values obtained in the predict function. The plot below shows the predictions of the three spatial fields of our species.


plot(projections, whattoplot = 'mean', 
     colourLow = 'orange', colourHigh = 'dark red')

References

Bachl, Fabian E, Finn Lindgren, David L Borchers, and Janine B Illian. 2019. “Inlabru: An r Package for Bayesian Spatial Modelling from Ecological Survey Data.” Methods in Ecology and Evolution 10 (6): 760–66. https://doi.org/10.1111/2041-210X.13168.
Bocinsky, R Kyle, Dylan Beaudette, Scott Chamberlain, and Maintainer R Kyle Bocinsky. 2019. “FedData: Functions to Automate Downloading Geospatial Data Available from Several Federated Data Sources.” https://cran.r-project.org/package=FedData.
Chamberlain, Scott. 2021. “Spocc: Interface to Species Occurrence Data Sources.” https://cran.r-project.org/package=spocc.
Hollister, J, and Tarak Shah. 2017. “Elevatr: Access Elevation Data from Various APIs.” https://cran.r-project.org/package=elevatr.
Isaac, Nick JB, Marta A Jarzyna, Petr Keil, Lea I Dambly, Philipp H Boersch-Supan, Ella Browning, Stephen N Freeman, et al. 2020. “Data Integration for Large-Scale Models of Species Distributions.” Trends in Ecology & Evolution 35 (1): 56–67. https://doi.org/10.1016/j.tree.2019.08.006.
Martins, Thiago G, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2013. “Bayesian Computing with INLA: New Features.” Computational Statistics & Data Analysis 67: 68–83. https://doi.org/10.1016/j.csda.2013.04.014.
Miller, David AW, Krishna Pacifici, Jamie S Sanderlin, and Brian J Reich. 2019. “The Recent Past and Promising Future for Data Integration Methods to Estimate Species’ Distributions.” Methods in Ecology and Evolution 10 (1): 22–37. https://doi.org/10.1111/2041-210X.13110.
rOpenSci. 2018. USABoundaries.” https://github.com/ropensci/USAboundaries.
Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series b (Statistical Methodology) 71 (2): 319–92. https://doi.org/10.1111/j.1467-9868.2008.00700.x.
Simmonds, Emily G, Susan G Jarvis, Peter A Henrys, Nick JB Isaac, and Robert B O’Hara. 2020. “Is More Data Always Better? A Simulation Study of Benefits and Limitations of Integrated Distribution Models.” Ecography 43 (10): 1413–22. https://doi.org/10.1111/ecog.05146.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, and Sigrunn H Sørbye. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28. https://doi.org/10.1214/16-STS576.