An introduction to BNPvegan

Bayesian nonparametric methods for ecology

Introduction

Installation

# If the devtools R package is not already installed
# install.packages("devtools")
devtools::install_github("alessandrozito/BNPvegan")
library(BNPvegan)

Package description

BNPvegan is an R package that implements some fundamental tools of the Bayesian nonparametric literature on Species Sampling models. The code is available at this github repository. The package has been designed for ecological applications. In this spirit, we chose to name it to mimic and extend the existing vegan package (Oksanen et al. 2020). In this tutorial we provide both a mathematical overview of the main methods available in BNPvegan, and we describe in details how to use it and how to interpret its results.

The package is constructed on two building blocks, which are highly interrelated:

  • The Species Sampling model module -ssm- which runs the basic processes in the species sampling model literature (Pitman 1996), namely the Dirichlet process (Ferguson 1973) and the Pitman-Yor process (Perman, Pitman, and Yor 1992).
  • The Species Discovery model module -sdm- which runs the species discovery model introduced in Zito et al. (2020).

Both functions are designed to work with a standard R syntax, as their output can be accessed via the classic methods summary, plot, coef and logLik.

In both cases, the main input of the package is a vector of abundances representing the species counts observed in a sample. The package is endowed with two example datasets:

  1. Lepidoptera, which is the dataset of butterfly counts in Fisher, Corbet, and Williams (1943)
  2. fungalOTU, which contains the counts for some fungal species recorded in a field experiment.

We will use both to show the basic functioning of ssm and sdm. As a preliminary step, however, we show how to obtain some simple sample-based quantities.

Sample-based quantities

We hereby refer to sample-based quantities as those indicators which are not inherited from the the ssm or the sdm method, but can be directly derived from the vector of species abundances without a specific model fit. These methods are the sample coverage, the Gini heterogeneity index and the rarefaction curve.

From a mathematical perspective, let \(X_1, \ldots, X_n\) be a collection of “species” tags, with \(X_i\) being the \(i\)th species observed in the sample. We call abundance the total number of species observed, \(n\), and richness the number distinct species observed in the sample, \(K\). Furthermore, let \(n_1, \ldots, n_K\) the abundances with which each species appears in the sample. The three sample-based methods offer a way to quickly summarize vector of abundances. We describe them in details one by one.

The sample coverage

The sample coverage is the sum of the proportions of species that have been observed in the sample. Is is defined as \[ C_n = \sum_{h \in \mathcal{H}} f_h, \quad \mathcal{H} = \{\text{"Indexes of the species observed among the } n \text{ data"}\} \] where \(\sum_{h=1}^H f_h = 1\) are the species proportions and \(H\) is the total number of species in the population. Clearly, \(K \leq H\), ie. the number of species discovered is less than the true number. The BNPvegan package offers a very simple first “estimator” for \(C_n\), the old-timely Turing estimator (Good 1953): \[ \hat{C}_n = 1 - \frac{m_1}{n}. \] with \(m_1\) being the number of species observed only once. This estimator can be called in R with the function coverage as follows.

# Turing estimator of the sample coverage
abundances <- as.numeric(Lepidoptera)
coverage(abundances)
## [1] 0.9975295

This tells that roughly 99.75% of the species in the population from which Lepidoptera was drawn have been discovered.

The Gini heterogeneity index

The Gini heterogeneity index is a measure to quantify the amount of biodiversity present in the sample. In probabilistic terms, the Gini index is the probability that two species are different when taken at random from the population:

\[ G = 1- \sum_{h = 1}^H f_h^2. \] An estimator for \(G\) is \[\hat{G} = 1 - \frac{1}{n^2}\sum_{j = 1}^K n_j^2,\] and can be obtained by running the Gini command:

# Sample-based Gini index
Gini(abundances)
## [1] 0.9749111

In other words, the probability that two species are equal when drawn at random from the population is 0.975. This result is in line with what found with the coverage method.

The rarefaction method

While often ecologists deal with species frequencies, sometimes the species tags \(X_1, \ldots, X_n\), are observed sequentially. This allows to construct an accumulation curve, which plots the number of distinct species observed, \(K_n\), as a function of the abundance, \(n\). However, in the cases in which the data are not sequential, one can cope for the ordering issue by constructing the so-called rarefaction curve, ie. the average accumulation curve. This quantity is derived from a combinatorial argument and it is equal to \[ \bar{K}_i = K - \binom{n}{i}^{-1} \sum_{j = 1}^K \binom{n - n_j}{i}, \] for \(i = 1, \ldots, n\). To compute it, it is sufficient to call the function rarefaction on a vector of frequencies. Notice that the function can be computationally demanding for large \(n\) and large \(K\). Thus, a verbose option is available if one wants to monitor the process.

# Compute and plot the sample-based rarefaction curve
rar <- rarefaction(abundances, verbose = TRUE)
## Number of samples processed: 1254 [10%] 
## Number of samples processed: 2508 [20%] 
## Number of samples processed: 3762 [30%] 
## Number of samples processed: 5016 [40%] 
## Number of samples processed: 6270 [50%] 
## Number of samples processed: 7524 [60%] 
## Number of samples processed: 8778 [70%] 
## Number of samples processed: 10032 [80%] 
## Number of samples processed: 11286 [90%] 
## Number of samples processed: 12540 [100%]
plot(rar, type = "l", xlab = "Sample size", ylab = "Rarefaction")

Note: the rarefaction method performs the same task as the rarecurve function in the vegan package.

While being relatively quick and easy to interpret, the sample-based quantities are not the product of a model. As such, there is no way to include uncertainty quantification in their estimates. In the next sections, however, we show how the methods ssm and sdm can solve for this issue.

Species Sampling models

Sequential allocation scheme

Species sampling models (Pitman 1996) are a class of generative models for species abundances which belong to the Bayesian nonprametric literature. The most common species sampling models are the Dirichlet process, "DP" (Ferguson 1973) and the Pitman-Yor process, "PY" (Perman, Pitman, and Yor 1992). For the purpose of this guide, we will only provide an overview of the basic properties of both models. For an extensive overview on how species sampling model work, see De Blasi et al. (2015).

Both the Dirichlet and the Pitman-Yor process generate species abundances through a sequential prediction scheme. Let \(X_{1}, \ldots, X_{n}\) be a sequence of individuals observed, \(K_n\) the total number of distinct species observed and \(X_{1}^*, \ldots, X_{K_n}^*\) the distinct species labels. In other words, \(X_{j}^*\) is the \(j\)th species label observed in among the first \(n\) observations. Then, the species of \(X_{n+1}\) is determined by the following prediction rules:

  • In the Dirichlet process: \[X_{n+1} \mid X_1, \ldots, X_n \sim \frac{\alpha}{\alpha + n}(\text{"new species"}) + \sum_{j=1}^{K_n} \frac{n_j}{\alpha + n}\delta_{X_j^*}\] with \(\alpha >0\);
  • In the Pitman-Yor process: \[ X_{n+1} \mid X_1, \ldots, X_n \sim \frac{\alpha + \sigma K_{n}}{\alpha + n}(\text{"new species"}) + \sum_{j=1}^{K_n} \frac{n_j - \sigma}{\alpha + n}\delta_{X_j^*} \] with \(\alpha >-\sigma\) and \(\sigma \in [0, 1)\).

The parameters \(\alpha\) and \(\sigma\) govern the allocation scheme. In particular, it is easy to see that the Pitman-Yor process degenerates to the Dirichlet process when \(\sigma = 0\). The interpretation of both equations is rather simple: the species observed at time \(n+1\) can either be of a new type or of a previously observed one. In particular, the probability of observing an “old” species is increasing in the abundance of that particular species. The package BNPvegan contains a random sampler for both processes, which returns a vector of species abundances.

set.seed(1)
n <- 1000
# Random sample from a Dirichlet process
rDP(n, alpha = 1)  
## [1] 583 305  75   7   4  16   8   1   1
# Random sample from a Pitman-Yor process 
rPY(n, alpha = 1, sigma = 0.3)  
##  [1] 180 141 542  43  10  31   1   4   1   3   8   5   1   2  11   2   1   1   3
## [20]   3   1   2   1   1   1   1

In the following subsection, we show how to perform inference on the parameters \(\alpha\) and \(\sigma\) via the ssm module.

The ssm module

In what follows, we describe the functioning of the ssm module by estimating a Pitman-Yor on the Lepidoptera dataset. All the functions are equivalent for the Dirichlet process.

The likelihood function of the Pitman-Yor process is given by the following quantity (known as exchangeable partition probability function): \[ \mathcal{L}(\alpha, \sigma \mid X_1, \ldots, X_n) = \frac{\prod_{j=1}^{K_n-1}(\alpha + j\sigma)}{(\alpha + 1)_{n-1}}\prod_{j=1}^{K_n}(1-\sigma)_{n_j - 1}. \] To estimate \(\alpha\) and \(\sigma\), the ssm function adopts an empirical Bayes procedure and performs a simple off-theshelf numerical optimization of \(\mathcal{L}(\alpha, \sigma \mid X_1, \ldots, X_n)\) via nlminb. The commands are the following.

# Estimate a Pitman-Yor model
abundances <- as.numeric(Lepidoptera)
fit <- ssm(abundances, model = "PY")  # set model = "DP" for the Dirichlet process
summary(fit)  # Summarize the output
## Model:
##   Pitman-Yor process (PY)
## 
## Quantities:
##   Abundance: 12548
##   Richness: 240
##   Estimated sample coverage: 0.9967
##   Posterior Gini diversity: 0.975
## 
## Extrapolations:
##   Expected species after additional 12548 samples: 269
##   New expected species after additional 12548 samples: 29
## 
## Asymptotics:
##   Expected species at infinity: Infinite
##   Standard deviation at infinity: NA
## 
## Parameters:
##       alpha   sigma      loglik
##   ---------  ------  ----------
##    41.99451       0   -53278.74

The summary function applied to an object of class ssm returns the following quantities:

  1. Abundance: the number of observation in the sample.
  2. Richness: the number of distinct species in the sample
  3. Estimated sample coverage: model-based sample coverage
  4. Posterior Gini diversity: model-based Gini diversity
  5. Expected species after additional xx samples: model-based extrapolation
  6. New expected species after additional xxx samples:: the difference between Expected species after additional xx samples and Richness
  7. Expected species at infinity: aysymptotic species richness under the model
  8. Parameters the empirical Bayes estimates for the parameters of the model.

To access the parameters and the loglikelihood, do

coef(fit)
## [1] 4.199451e+01 1.000000e-16
logLik(fit)
## [1] -53278.74

Notice that in this particular case, \(\sigma\) is estimated to be 0 (ie. the Pitman-Yor degenerates to the Dirichlet process). To assess the goodness-of-fit of the ssm function, we plot the observed frequency-of-abundances against the model-based ones.

# Frequency-of-frequencies plot
plot(fit, type = "freq")

Such a plot can be obtained by looking at the model-based frequency-of-abundances, which are available in closed form for the Pitman-Yor process (Favaro, Lijoi, and Prunster 2013). In particular, for a given \(r = 1, \ldots, n\), the number of species appearing \(r\) times is is equal to

\[ \mathbb{E}(M_r) = \frac{\alpha}{(\alpha)_n}\binom{n}{r}(1-\sigma)_{r-1}(\alpha + \sigma)_{n-r}. \] To obtain it, just use the following method

# Frequency-of-abundances estimate under Pitman-Yor model
freq_abundance(abundances, r = 1:10)
##  1  2  3  4  5  6  7  8  9 10 
## 31  9 13 20 11 13  7  6  2  4
freq_abundance(fit, r = 1:10)
##         1         2         3         4         5         6         7         8 
## 41.857758 20.860721 13.861853 10.362527  8.263018  6.863416  5.863762  5.114075 
##         9        10 
##  4.531032  4.064640

The ssm module offers an alternative to the sample-based estimators described in the previous section. In particular, it is endowed with a model-based coverage, Gini diversity and rarefaction function. Moreover, due to the sequential urn scheme characteristic, it also allows to extrapolate the rarefaction curve via the command extrapolation, ie. predict the additional number of distinct species if additional samples where observed. Let’s see the one by one.

Model-based coverage

In a Pitman-Yor model, the distribution of the posterior sample coverage is \[C_n \mid X_1, \ldots, X_n \sim Beta(n-\sigma K_n, \alpha + \sigma K_n)\] and the posterior mean is \[\mathbb{E}(C_n \mid X_1, \ldots, X_n) = \mathbb{P}(X_{n+1} = \text{old species} \mid X_1, \ldots, X_n) = \frac{n-\sigma K_n}{\alpha + n}\] The advantage of this result is that it allows to obtain uncertainty quantification on the coverage. To see this, just run the following commands:

# Estimate for the mean coverage under a Pitman-Yor model
coverage(fit)
## [1] 0.9966645
# Random samples from the posterior distribution of the coverage
coverage_samples <- rcoverage(fit) 
head(coverage_samples)
## [1] 0.9970554 0.9959044 0.9971507 0.9965390 0.9966237 0.9975949
# Plot the posterior distribution of the coverage
plot(fit, type = "coverage")

Posterior Gini index

The Pitman-Yor process admits a closed for estimator for the Gini index. In particular, the a priori Bayesian estimate for the Gini index is \[\mathbb{E}(G) = \frac{\alpha + \sigma}{\alpha + 1},\] while the a posteriori estimate is \[ \mathbb{E}(G\mid X_1\ldots, X_n) = 1- \frac{1}{(\alpha + n)_2}\Big\{(1 -\sigma)(\alpha + K_n\sigma) + \sum_{j=1}^K(n_j - \sigma)_2\Big\} \] where \((a)_n = \Gamma(a + n)/\Gamma(a)\) is the Pochhammer symbol. To obtain the estimate for the expected posterior Gini index, run the Gini function.

# Expected nodel-based Gini inidex
Gini(fit)
## [1] 0.9750008

Like in the coverage method, we can also obtain posterior samples for the Gini index. This might be computationally slow, as obtaining such samples requires running a stick breaking construction.

# Random samplers for the Gini coefficient
set.seed(1)
Gini_samples <- rGini(fit, n_samples = 500)
hist(Gini_samples, main = "Posterior Gini index")

Model-based rarefaction and extrapolation.

Both the Dirichlet and the Pitman-Yor process assume that the observations \(X_1, \ldots, X_n\) are exchangeable. In other words, the order in which they appear does not matter. Moreover, we can show that the rarefaction curve in the Pitman-Yor process is a Markov process such that \(K_1 = 1\) and

\[ K_{n+1} \mid K_n = K_{n-1} + D_n, \quad (D_n \mid K_n)\sim \text{Bern}\Big(\frac{\alpha + \sigma K_n}{\alpha + n}\Big) \]

Moreover, the associated Bayesian estimate for the rarefaction curve is \[\mathbb{E}(K_n) = \frac{\alpha}{\sigma}\Big\{ \frac{(\alpha + \sigma)_n}{(\alpha)_n} - 1\Big\}\] The numerical values and the associated plot for the rarefaction curve are obtained as follows.

# Model-based rarefaction curve
rar <- rarefaction(fit)
head(rar)
## [1] 1.000000 1.976741 2.931281 3.864606 4.777639 5.671244
tail(rar)
## [1] 239.9833 239.9866 239.9900 239.9933 239.9967 240.0000
# Plot the rarefaction curve
plot(fit, type = "rarefaction", verbose = FALSE, plot_sample = TRUE)

A convenient property of the Pitman-Yor process is the fact that the prediction of additional number of distinct species if \(m\) new samples where observed is available in closed form. In particular, following Favaro et al. (2009), we have that \[ \mathbb{E}(K_m^{(n)} \mid K_n = k) = \Big(k + \frac{\alpha}{\sigma}\Big)\Big\{\frac{(\alpha + n + \sigma)_m}{(\alpha + n)_m} - 1\Big\} \] where \(K_m^{(n)}\) is the number of new species observed under additional \(m\) samples after \(n\) individuals have been observed. The extrapolation function then computes the quantity \(k + \mathbb{E}(K_m^{(n)} \mid K_n = k)\) for any given value of the parameter \(m\). In particular:

# Obtain the extrapolation estimates for m additional samples
extr <- extrapolation(fit, m = c(1:1000))
head(extr)
## [1] 240.0033 240.0067 240.0100 240.0133 240.0167 240.0200
tail(extr)
## [1] 243.1944 243.1975 243.2006 243.2036 243.2067 243.2098
# Plot the extrapolation curve
plot(fit, type = "extrapolation", m = 3000, plot_sample = TRUE, verbose = FALSE)

A final note: in the Pitman-Yor process, the growth rate of the number of clusters \(K_n\) is \(\mathcal{O}(n^\sigma)\), while in the Dirichlet process it is \(\mathcal{O}(\alpha \log n)\). In other words, both models assume an infinite number of species asymptotically. Thus, they cannot be used to assess the sample saturation. Dealing with converging accumulation curves is the purpose of the sdm module.

Species Discovery models

Background

  • Species discovery models focus directly on the discovery of new species. Unlike Species Sampling models, they are not a generative model for a species abundances.
  • Species discovery models are designed to fit and predict any accumulation curve, potentially assessing its saturation level, ie. how close the curve is to convergence.
  • Their formulation involves a direct modelling of the accumulation curve obtained from the sample.

Let \((D_n)_{n \geq 1}\) be a sequence of discovery indicators, such that

\[ D_n = \begin{cases} 1 & \text{if} \quad X_n = ``\text{new"}\\ 0 & \text{if} \quad X_n = ``\text{observed"} \end{cases} \] and trivially \(D_1 = 1\). Then, an accumulation curve \((K_n)_{n \geq 1}\) is defined as \[ K_n = \sum_{i = 1}^{n} D_i, \quad n \geq 1. \]

The goal of a Species Discovery model is threefold:

  1. Rarefy: obtain an in-sample estimator for the number of in-sample discoveries \(\mathbb{E}(K_n)\). This is a model-based rarefaction, as it estimates a functional form for the rarefaction curve.
  2. Extrapolate: predict the expected additional number of new species to be observed if more samples where obtained, \(\mathbb{E}(K_{n+m} \mid K_n = k)\) and \(m \geq 1\).
  3. Saturate: Estimate the sample richness \(\mathbb{E}(K_\infty\mid K_n = k)\) and the sample saturation \(\mathbb{E}(S_n\mid K_n = k)\), with \(S_n = K_n/K_\infty\).

All three goals are obtain by directly modelling the discovery probabilities linked to \(D_1, \ldots, D_n\). In particular, we let \(D_n \stackrel{iid}{\sim} Bernoulli(\pi_n)\), with

\[ \pi_{n+1} = \mathbb{P}(D_{n+1} = 1) = \mathcal{S}(n;\theta), \quad \theta \in \Theta \subset \mathbb{R}^p \] be the discovery probability and \(\mathcal{S}(n;\theta)\) a survival function such that

  • \(S(0; \theta) = 1\) - the first observation is necessarily a new discovery,
  • \(S(n; \theta) > S(n+1; \theta)\) - the discovery probabilities decrease over time,
  • \(\lim_{n\to \infty}S(n;\theta) = 0\) - eventually, all species in the sample are discovered.

Then, any functional form that supports the three assumptions can work as a species discovery model. The package BNPvegan supports two types of survival functions:

  • The three-parameter log-logistic model, LL3 (the default): \[ \mathbb{P}(D_{n+1} = 1) = \frac{\alpha\phi^n}{\alpha\phi^n + n^{1-\sigma} } \] with \(\alpha >0\), \(\sigma <1\) and \(\phi \in (0,1)\). Notice that for \(\phi = 1\) and \(\sigma = 0\), LL3 has the same discovery probability of the Dirichlet process .

  • The Weibull model: \[ \mathbb{P}(D_{n+1} = 1) = \phi^{n^\lambda} \] with \(\phi \in (0,1)\), \(\lambda > 0\).

Under both model, the value for \(K_\infty = \lim_{n\to \infty} K_n\) is guaranteed to converge. In other words, the species discovery model will always ensure a finite saturation.

The sdm module

The sdm module runs the model described above on the sample-based rarefaction curve of the specified vector of frequencies. The available choice for model are model = "LL3" (default) and model = "Weibull". Both models are estimated via empirical Bayes by maximizing the log-likelihood associated to the collection of independent Bernoulli random variables. This quantity is equal to \[ \mathcal{L}(\theta\mid D_1, \ldots, D_n) = \prod_{i = 2}^n \mathcal{S}(i-1;\theta)^{D_n}(1-\mathcal{S}(i-1;\theta))^{1-D_n} \] where the degenerate case \(i= 1\) is trivially excluded. As already mentioned above, it is important to notice that often the sequence of species discovery is not available. In the sample based-methods, this problem is solved by computing the rarefaction curve from the vector of given frequencies, while in the ssm module the exchangeability assumption implies that the ordering is irrelevant. In the sdm case however, the order of the discoveries does matter. To solve this issue, the module maximizes the following likelihood: \[ \mathcal{L}(\theta\mid \tilde{D}_1, \ldots, \tilde{D}_n) = \prod_{i = 2}^n \mathcal{S}(i-1;\theta)^{\tilde{D}_n}(1-\mathcal{S}(i-1;\theta))^{1-\tilde{D}_n} \] where \(\tilde{D}_i = \bar{K}_{i+1}- \bar{K}_i\) and \(\bar{K}_i\) is derived from the average sample-based rarefaction curve. Thus, sdm works in three steps

  1. Compute the sample-based rarefaction curve \(\hat{K}_i\) for \(i = 1, \ldots, n\)
  2. Compute the differences and \(\tilde{D}_i = \hat{K}_{i+1}- \hat{K}_i\) with \(\tilde{D}_i = 1\)
  3. Obtain the empirical Bayes estimates for \(\theta\) via constrained maximization of the likelihood \(\mathcal{L}(\theta\mid \tilde{D}_1, \ldots, \tilde{D}_n)\).

In the case of model = "LL3, the maximization of the likelihood can be obtained with a simple constrained logistic regression (package glmnet). This is makes the sdm function particularly efficient

# Fit the species discovery model. 
# abundances <- fungalOTU
abundances <- Lepidoptera

# Note: verbose controls the state of the construction of the sample-based rarefaction, which is 
#       the most demanding part
fit_sdm <- sdm(abundances, model = "LL3", verbose = FALSE)

# Summarise the output
summary(fit_sdm)
## Model:
##   Three-parameter log-logistic (LL3)
## 
## Quantities:
##   Abundance: 12548
##   Richness: 240
##   Estimated sample coverage: 0.9981
## 
## Extrapolations:
##   Expected species after 12548 additional samples: 252
##   Expected new species after 12548 additional samples: 12
## 
## Asymptotics:
##   Expected species at infinity: 257
##   Standard deviation at infinity: 16.04
##   Expected new species to discover: 17
##   Estimated saturation: 0.9339
## 
## Parameters:
##       alpha       sigma         phi      logLik
##   ---------  ----------  ----------  ----------
##    35.03537   0.0500194   0.9999335   -868.7822

The quantities reported in the summary are:

  1. Abundance: the number of observation in the sample.
  2. Richness: the number of distinct species in the sample
  3. Estimated sample coverage: model-based sample coverage
  4. Expected species after xx additional samples: model-based rarefaction
  5. Expected new species after xx additional samples: difference between Expected species after xx additional samples and Richness
  6. Expected species at infinity: the asymptotic posterior species richness, ie. \(\mathbb{E}(K_\infty\mid K_n = k)\)
  7. Standard deviation at infinity: the asymptotic standard deviation ie. \(\{Var(K_\infty\mid K_n = k)\}^{1/2}\)
  8. Expected new species to discover: the difference between Expected species at infinity and Richness
  9. Sample saturation: the ratio between Richness and Expected species at infinity.
  10. Parameters the empirical Bayes estimates for the parameters of the model.

The parameters are easily extracted with the following commands

# Parameters
coef(fit_sdm)
##       alpha       sigma         phi 
## 35.03537487  0.05001943  0.99993348
# Loglikelihood
logLik(fit_sdm)
## [1] -868.7822

A rarefaction and an extrapolation estimator

The main advantage of the sdm framework is that it is naturally endowed with a rarefaction and an extrapolation estimator. Indeed, we can prove that the number of distinct species \(K_n\) is distributed as a Poisson-binomial distribution, \[K_n \sim PB\{1,\mathcal{S}(1, \theta), \ldots,\mathcal{S}(n-1, \theta)\},\] with mean \[\mathbb{E}(K_n) = \sum_{i =1}^n \mathcal{S}(i-1, \theta).\] The above quantity can be obtained by using the rarefaction method.

# Extract the model-based rarefaction curve for sdm
rar_sdm <- rarefaction(fit_sdm)
head(rar_sdm)
## [1] 1.000000 1.972248 2.919983 3.844996 4.748706 5.632318
tail(rar_sdm)
## [1] 239.9903 239.9922 239.9942 239.9961 239.9981 240.0000
# Plot the model-based and the sample-based rarefaction
plot(fit_sdm, type ="rarefaction") 

As for extrapolation, it can be easily proved that for any \(m\geq 1\),

\[ \mathbb{E}(K_{n+m}\mid K_n = k) = k + \sum_{j = 0}^{n-m-1} \mathcal{S}(j, \theta). \]

Again, the method works in the same way as in the ssm module

# Extrapolate in sdm
extr_sdm <- extrapolation(fit_sdm, m = 1:1000)
head(extr_sdm)
## [1] 240.0019 240.0039 240.0058 240.0078 240.0097 240.0116
tail(extr_sdm)
## [1] 241.8010 241.8027 241.8044 241.8061 241.8077 241.8094
# Plot in sdm
plot(fit_sdm, type = "extrapolation", m = 10000)

Saturation and asymptotic richness methods

The final feature of sdm is the fact that it provides an estimate for the asymptotic species richness, which is the total number of distinct species present in the population. The estimator is equal to \[\mathbb{E}(K_\infty \mid K_n = k) = k + \sum_{j = n}^\infty S(j; \theta) <\infty.\] Moreover, \(K_\inf\mid K_n = k\) is a random variable. To simulate from it, run

### Simulate from the posterior distribution of the species richness
set.seed(1)
Kinf <- sample_Kinf(fit_sdm, size = 1000)
plot(density(Kinf), main = "Posterior species richness")

By sampling from the posterior species richness we can also assess the saturation level, which is the fraction of species discovered over the asymptotic estimate, namely \[S_n\mid K_n = k = \frac{k}{K_\infty}.\] As computing the expected value for this quantity can be demanding, sdm gives the possibility to compute \(k/\mathbb{E}(K_\infty \mid K_n = k)\), which is a first approximation. This can be achieved my specifying method = "approximate" (the default). Otherwise, if one wants to carry uncertainty quantification on the saturation level, set method = "montecarlo" and specify the number of samples n_samples to draw. This last option can be slow. See the following code:

# Asymptotic richness estimators
asym_richness(fit_sdm)
## Exp. species at Inf           sd at Inf 
##           257.00000            16.03976
# Approximate saturation
saturation(fit_sdm, method = "approximate")
## [1] 0.9338521
# Samples from the posterior saturation
set.seed(1)
post_sat <- saturation(fit_sdm, method = "montecarlo", n_samples = 500)
plot(density(post_sat), main = "Posterior saturation")

Finally, we can compute the approximate number of additional samples to draw to reach a desired level of saturation. Can be obtained by specifying method = "target" and a desired target. Note that target has to be between 0 and 0.999 for stability reasons.

# Additional samples to reach a desired level of saturation
saturation(fit_sdm, method = "target", target = 0.9)
## The approximate saturation is already 0.9339 
##  Please specify a higher target value
saturation(fit_sdm, method = "target", target = 0.99)
## [1] 18912

Acknowledgements

The build up of this package and the research have been funded by the LIFEPLAN project (European Union’s Horizon 2020 research and innovation program - grant agreement No 856506).

References

De Blasi, P., S. Favaro, A. Lijoi, R. H. Mena, I. Prunster, and M. Ruggiero. 2015. “Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process?” IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2): 212–29.
Favaro, S., A. Lijoi, R. H. Mena, and I. Prunster. 2009. “Bayesian Non-Parametric Inference for Species Variety with a Two-Parameter Poisson–Dirichlet Process Prior.” Journal of the Royal Statistical Society, Series B 71: 993–1008.
Favaro, S., A. Lijoi, and I. Prunster. 2013. “Conditional Formulae for Gibbs-Type Exchangeable Random Partitions.” Annals of Applied Probability 23: 1721–54.
Ferguson, T. S. 1973. “A Bayesian Analysis of Some Nonparametric Problems.” Annals of Statistics 1: 209–30.
Fisher, R. A., A. S. Corbet, and C. B. Williams. 1943. “The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population.” Journal of Animal Ecology 12 (1): 42–58.
Good, I. J. 1953. “The Population Frequencies of Species and the Estimation of Populations Parameters.” Biometrika 40: 237–64.
Oksanen, J., F. G. Blanchet, M. Friendly, R. Kindt, P. Legendre, D. McGlinn, P. R. Minchin, et al. 2020. “Vegan: Community Ecology Package.” 2020.
Perman, M., J. Pitman, and M. Yor. 1992. “Size-Biased Sampling of Poisson Point Processes and Excursions.” Probability Theory and Related Fields 92 (1): 21–39.
Pitman, J. 1996. “Some Developments of the Blackwell-Macqueen Urn Scheme.” In Statistics, Probability and Game Theory. Papers in Honor of David Blackwell, edited by T. S. Ferguson, L. S. Shapley, and James B. MacQueen, 30:245–67. IMS Lecture notes, Monograph Series. Hayward: Institute of Mathematical Statistics.
Zito, A., T. Rigon, O. Ovaskainen, and D. B. Dunson. 2020. “Bayesian Nonparametric Modelling of Sequential Discoveries.” Https://Arxiv.org/Abs/2011.06629s.

  1. Duke University, ↩︎

  2. University of Milano-Bicocca, ↩︎