An introduction to BNPvegan
Bayesian nonparametric methods for ecology
Introduction
Installation
# If the devtools R package is not already installed
# install.packages("devtools")
::install_github("alessandrozito/BNPvegan")
devtoolslibrary(BNPvegan)
Package description
BNPvegan
is an R package that implements some fundamental tools of the Bayesian nonparametric literature on Species Sampling models. The code is available at this github repository. The package has been designed for ecological applications. In this spirit, we chose to name it to mimic and extend the existing vegan
package (Oksanen et al. 2020). In this tutorial we provide both a mathematical overview of the main methods available in BNPvegan
, and we describe in details how to use it and how to interpret its results.
The package is constructed on two building blocks, which are highly interrelated:
- The Species Sampling model module -
ssm
- which runs the basic processes in the species sampling model literature (Pitman 1996), namely the Dirichlet process (Ferguson 1973) and the Pitman-Yor process (Perman, Pitman, and Yor 1992).
- The Species Discovery model module -
sdm
- which runs the species discovery model introduced in Zito et al. (2020).
Both functions are designed to work with a standard R
syntax, as their output can be accessed via the classic methods summary
, plot
, coef
and logLik
.
In both cases, the main input of the package is a vector of abundances
representing the species counts observed in a sample. The package is endowed with two example datasets:
Lepidoptera
, which is the dataset of butterfly counts in Fisher, Corbet, and Williams (1943)fungalOTU
, which contains the counts for some fungal species recorded in a field experiment.
We will use both to show the basic functioning of ssm
and sdm
. As a preliminary step, however, we show how to obtain some simple sample-based quantities.
Sample-based quantities
We hereby refer to sample-based quantities as those indicators which are not inherited from the the ssm
or the sdm
method, but can be directly derived from the vector of species abundances
without a specific model fit. These methods are the sample coverage
, the Gini
heterogeneity index and the rarefaction
curve.
From a mathematical perspective, let \(X_1, \ldots, X_n\) be a collection of “species” tags, with \(X_i\) being the \(i\)th species observed in the sample. We call abundance the total number of species observed, \(n\), and richness the number distinct species observed in the sample, \(K\). Furthermore, let \(n_1, \ldots, n_K\) the abundances with which each species appears in the sample. The three sample-based methods offer a way to quickly summarize vector of abundances
. We describe them in details one by one.
The sample coverage
The sample coverage is the sum of the proportions of species that have been observed in the sample. Is is defined as \[
C_n = \sum_{h \in \mathcal{H}} f_h, \quad \mathcal{H} = \{\text{"Indexes of the species observed among the } n \text{ data"}\}
\] where \(\sum_{h=1}^H f_h = 1\) are the species proportions and \(H\) is the total number of species in the population. Clearly, \(K \leq H\), ie. the number of species discovered is less than the true number. The BNPvegan
package offers a very simple first “estimator” for \(C_n\), the old-timely Turing estimator (Good 1953): \[
\hat{C}_n = 1 - \frac{m_1}{n}.
\] with \(m_1\) being the number of species observed only once. This estimator can be called in R
with the function coverage
as follows.
# Turing estimator of the sample coverage
<- as.numeric(Lepidoptera)
abundances coverage(abundances)
## [1] 0.9975295
This tells that roughly 99.75% of the species in the population from which Lepidoptera
was drawn have been discovered.
The Gini
heterogeneity index
The Gini heterogeneity index is a measure to quantify the amount of biodiversity present in the sample. In probabilistic terms, the Gini index is the probability that two species are different when taken at random from the population:
\[
G = 1- \sum_{h = 1}^H f_h^2.
\] An estimator for \(G\) is \[\hat{G} = 1 - \frac{1}{n^2}\sum_{j = 1}^K n_j^2,\] and can be obtained by running the Gini
command:
# Sample-based Gini index
Gini(abundances)
## [1] 0.9749111
In other words, the probability that two species are equal when drawn at random from the population is 0.975. This result is in line with what found with the coverage
method.
The rarefaction
method
While often ecologists deal with species frequencies, sometimes the species tags \(X_1, \ldots, X_n\), are observed sequentially. This allows to construct an accumulation curve, which plots the number of distinct species observed, \(K_n\), as a function of the abundance, \(n\). However, in the cases in which the data are not sequential, one can cope for the ordering issue by constructing the so-called rarefaction curve, ie. the average accumulation curve. This quantity is derived from a combinatorial argument and it is equal to \[
\bar{K}_i = K - \binom{n}{i}^{-1} \sum_{j = 1}^K \binom{n - n_j}{i},
\] for \(i = 1, \ldots, n\). To compute it, it is sufficient to call the function rarefaction
on a vector of frequencies
. Notice that the function can be computationally demanding for large \(n\) and large \(K\). Thus, a verbose
option is available if one wants to monitor the process.
# Compute and plot the sample-based rarefaction curve
<- rarefaction(abundances, verbose = TRUE) rar
## Number of samples processed: 1254 [10%]
## Number of samples processed: 2508 [20%]
## Number of samples processed: 3762 [30%]
## Number of samples processed: 5016 [40%]
## Number of samples processed: 6270 [50%]
## Number of samples processed: 7524 [60%]
## Number of samples processed: 8778 [70%]
## Number of samples processed: 10032 [80%]
## Number of samples processed: 11286 [90%]
## Number of samples processed: 12540 [100%]
plot(rar, type = "l", xlab = "Sample size", ylab = "Rarefaction")
Note: the rarefaction
method performs the same task as the rarecurve
function in the vegan
package.
While being relatively quick and easy to interpret, the sample-based quantities are not the product of a model. As such, there is no way to include uncertainty quantification in their estimates. In the next sections, however, we show how the methods ssm
and sdm
can solve for this issue.
Species Sampling models
Sequential allocation scheme
Species sampling models (Pitman 1996) are a class of generative models for species abundances which belong to the Bayesian nonprametric literature. The most common species sampling models are the Dirichlet process, "DP"
(Ferguson 1973) and the Pitman-Yor process, "PY"
(Perman, Pitman, and Yor 1992). For the purpose of this guide, we will only provide an overview of the basic properties of both models. For an extensive overview on how species sampling model work, see De Blasi et al. (2015).
Both the Dirichlet and the Pitman-Yor process generate species abundances through a sequential prediction scheme. Let \(X_{1}, \ldots, X_{n}\) be a sequence of individuals observed, \(K_n\) the total number of distinct species observed and \(X_{1}^*, \ldots, X_{K_n}^*\) the distinct species labels. In other words, \(X_{j}^*\) is the \(j\)th species label observed in among the first \(n\) observations. Then, the species of \(X_{n+1}\) is determined by the following prediction rules:
- In the Dirichlet process: \[X_{n+1} \mid X_1, \ldots, X_n \sim \frac{\alpha}{\alpha + n}(\text{"new species"}) + \sum_{j=1}^{K_n} \frac{n_j}{\alpha + n}\delta_{X_j^*}\] with \(\alpha >0\);
- In the Pitman-Yor process: \[ X_{n+1} \mid X_1, \ldots, X_n \sim \frac{\alpha + \sigma K_{n}}{\alpha + n}(\text{"new species"}) + \sum_{j=1}^{K_n} \frac{n_j - \sigma}{\alpha + n}\delta_{X_j^*} \] with \(\alpha >-\sigma\) and \(\sigma \in [0, 1)\).
The parameters \(\alpha\) and \(\sigma\) govern the allocation scheme. In particular, it is easy to see that the Pitman-Yor process degenerates to the Dirichlet process when \(\sigma = 0\). The interpretation of both equations is rather simple: the species observed at time \(n+1\) can either be of a new type or of a previously observed one. In particular, the probability of observing an “old” species is increasing in the abundance of that particular species. The package BNPvegan
contains a random sampler for both processes, which returns a vector of species abundances.
set.seed(1)
<- 1000
n # Random sample from a Dirichlet process
rDP(n, alpha = 1)
## [1] 583 305 75 7 4 16 8 1 1
# Random sample from a Pitman-Yor process
rPY(n, alpha = 1, sigma = 0.3)
## [1] 180 141 542 43 10 31 1 4 1 3 8 5 1 2 11 2 1 1 3
## [20] 3 1 2 1 1 1 1
In the following subsection, we show how to perform inference on the parameters \(\alpha\) and \(\sigma\) via the ssm
module.
The ssm
module
In what follows, we describe the functioning of the ssm
module by estimating a Pitman-Yor on the Lepidoptera
dataset. All the functions are equivalent for the Dirichlet process.
The likelihood function of the Pitman-Yor process is given by the following quantity (known as exchangeable partition probability function): \[
\mathcal{L}(\alpha, \sigma \mid X_1, \ldots, X_n) = \frac{\prod_{j=1}^{K_n-1}(\alpha + j\sigma)}{(\alpha + 1)_{n-1}}\prod_{j=1}^{K_n}(1-\sigma)_{n_j - 1}.
\] To estimate \(\alpha\) and \(\sigma\), the ssm
function adopts an empirical Bayes procedure and performs a simple off-theshelf numerical optimization of \(\mathcal{L}(\alpha, \sigma \mid X_1, \ldots, X_n)\) via nlminb
. The commands are the following.
# Estimate a Pitman-Yor model
<- as.numeric(Lepidoptera)
abundances <- ssm(abundances, model = "PY") # set model = "DP" for the Dirichlet process
fit summary(fit) # Summarize the output
## Model:
## Pitman-Yor process (PY)
##
## Quantities:
## Abundance: 12548
## Richness: 240
## Estimated sample coverage: 0.9967
## Posterior Gini diversity: 0.975
##
## Extrapolations:
## Expected species after additional 12548 samples: 269
## New expected species after additional 12548 samples: 29
##
## Asymptotics:
## Expected species at infinity: Infinite
## Standard deviation at infinity: NA
##
## Parameters:
## alpha sigma loglik
## --------- ------ ----------
## 41.99451 0 -53278.74
The summary
function applied to an object of class ssm
returns the following quantities:
Abundance
: the number of observation in the sample.Richness
: the number of distinct species in the sampleEstimated sample coverage
: model-based sample coveragePosterior Gini diversity
: model-based Gini diversityExpected species after additional xx samples
: model-based extrapolationNew expected species after additional xxx samples:
: the difference betweenExpected species after additional xx samples
andRichness
Expected species at infinity
: aysymptotic species richness under the modelParameters
the empirical Bayes estimates for the parameters of the model.
To access the parameters and the loglikelihood, do
coef(fit)
## [1] 4.199451e+01 1.000000e-16
logLik(fit)
## [1] -53278.74
Notice that in this particular case, \(\sigma\) is estimated to be 0 (ie. the Pitman-Yor degenerates to the Dirichlet process). To assess the goodness-of-fit of the ssm
function, we plot the observed frequency-of-abundances against the model-based ones.
# Frequency-of-frequencies plot
plot(fit, type = "freq")
Such a plot can be obtained by looking at the model-based frequency-of-abundances, which are available in closed form for the Pitman-Yor process (Favaro, Lijoi, and Prunster 2013). In particular, for a given \(r = 1, \ldots, n\), the number of species appearing \(r\) times is is equal to
\[ \mathbb{E}(M_r) = \frac{\alpha}{(\alpha)_n}\binom{n}{r}(1-\sigma)_{r-1}(\alpha + \sigma)_{n-r}. \] To obtain it, just use the following method
# Frequency-of-abundances estimate under Pitman-Yor model
freq_abundance(abundances, r = 1:10)
## 1 2 3 4 5 6 7 8 9 10
## 31 9 13 20 11 13 7 6 2 4
freq_abundance(fit, r = 1:10)
## 1 2 3 4 5 6 7 8
## 41.857758 20.860721 13.861853 10.362527 8.263018 6.863416 5.863762 5.114075
## 9 10
## 4.531032 4.064640
The ssm
module offers an alternative to the sample-based estimators described in the previous section. In particular, it is endowed with a model-based coverage
, Gini
diversity and rarefaction
function. Moreover, due to the sequential urn scheme characteristic, it also allows to extrapolate the rarefaction curve via the command extrapolation
, ie. predict the additional number of distinct species if additional samples where observed. Let’s see the one by one.
Model-based coverage
In a Pitman-Yor model, the distribution of the posterior sample coverage is \[C_n \mid X_1, \ldots, X_n \sim Beta(n-\sigma K_n, \alpha + \sigma K_n)\] and the posterior mean is \[\mathbb{E}(C_n \mid X_1, \ldots, X_n) = \mathbb{P}(X_{n+1} = \text{old species} \mid X_1, \ldots, X_n) = \frac{n-\sigma K_n}{\alpha + n}\] The advantage of this result is that it allows to obtain uncertainty quantification on the coverage. To see this, just run the following commands:
# Estimate for the mean coverage under a Pitman-Yor model
coverage(fit)
## [1] 0.9966645
# Random samples from the posterior distribution of the coverage
<- rcoverage(fit)
coverage_samples head(coverage_samples)
## [1] 0.9970554 0.9959044 0.9971507 0.9965390 0.9966237 0.9975949
# Plot the posterior distribution of the coverage
plot(fit, type = "coverage")
Posterior Gini
index
The Pitman-Yor process admits a closed for estimator for the Gini index. In particular, the a priori Bayesian estimate for the Gini index is \[\mathbb{E}(G) = \frac{\alpha + \sigma}{\alpha + 1},\] while the a posteriori estimate is \[
\mathbb{E}(G\mid X_1\ldots, X_n) = 1- \frac{1}{(\alpha + n)_2}\Big\{(1 -\sigma)(\alpha + K_n\sigma) + \sum_{j=1}^K(n_j - \sigma)_2\Big\}
\] where \((a)_n = \Gamma(a + n)/\Gamma(a)\) is the Pochhammer symbol. To obtain the estimate for the expected posterior Gini index, run the Gini
function.
# Expected nodel-based Gini inidex
Gini(fit)
## [1] 0.9750008
Like in the coverage
method, we can also obtain posterior samples for the Gini index. This might be computationally slow, as obtaining such samples requires running a stick breaking construction.
# Random samplers for the Gini coefficient
set.seed(1)
<- rGini(fit, n_samples = 500)
Gini_samples hist(Gini_samples, main = "Posterior Gini index")
Model-based rarefaction
and extrapolation
.
Both the Dirichlet and the Pitman-Yor process assume that the observations \(X_1, \ldots, X_n\) are exchangeable. In other words, the order in which they appear does not matter. Moreover, we can show that the rarefaction curve in the Pitman-Yor process is a Markov process such that \(K_1 = 1\) and
\[ K_{n+1} \mid K_n = K_{n-1} + D_n, \quad (D_n \mid K_n)\sim \text{Bern}\Big(\frac{\alpha + \sigma K_n}{\alpha + n}\Big) \]
Moreover, the associated Bayesian estimate for the rarefaction curve is \[\mathbb{E}(K_n) = \frac{\alpha}{\sigma}\Big\{ \frac{(\alpha + \sigma)_n}{(\alpha)_n} - 1\Big\}\] The numerical values and the associated plot for the rarefaction curve are obtained as follows.
# Model-based rarefaction curve
<- rarefaction(fit)
rar head(rar)
## [1] 1.000000 1.976741 2.931281 3.864606 4.777639 5.671244
tail(rar)
## [1] 239.9833 239.9866 239.9900 239.9933 239.9967 240.0000
# Plot the rarefaction curve
plot(fit, type = "rarefaction", verbose = FALSE, plot_sample = TRUE)
A convenient property of the Pitman-Yor process is the fact that the prediction of additional number of distinct species if \(m\) new samples where observed is available in closed form. In particular, following Favaro et al. (2009), we have that \[
\mathbb{E}(K_m^{(n)} \mid K_n = k) = \Big(k + \frac{\alpha}{\sigma}\Big)\Big\{\frac{(\alpha + n + \sigma)_m}{(\alpha + n)_m} - 1\Big\}
\] where \(K_m^{(n)}\) is the number of new species observed under additional \(m\) samples after \(n\) individuals have been observed. The extrapolation
function then computes the quantity \(k + \mathbb{E}(K_m^{(n)} \mid K_n = k)\) for any given value of the parameter \(m\). In particular:
# Obtain the extrapolation estimates for m additional samples
<- extrapolation(fit, m = c(1:1000))
extr head(extr)
## [1] 240.0033 240.0067 240.0100 240.0133 240.0167 240.0200
tail(extr)
## [1] 243.1944 243.1975 243.2006 243.2036 243.2067 243.2098
# Plot the extrapolation curve
plot(fit, type = "extrapolation", m = 3000, plot_sample = TRUE, verbose = FALSE)
A final note: in the Pitman-Yor process, the growth rate of the number of clusters \(K_n\) is \(\mathcal{O}(n^\sigma)\), while in the Dirichlet process it is \(\mathcal{O}(\alpha \log n)\). In other words, both models assume an infinite number of species asymptotically. Thus, they cannot be used to assess the sample saturation. Dealing with converging accumulation curves is the purpose of the sdm
module.
Species Discovery models
Background
- Species discovery models focus directly on the discovery of new species. Unlike Species Sampling models, they are not a generative model for a species abundances.
- Species discovery models are designed to fit and predict any accumulation curve, potentially assessing its saturation level, ie. how close the curve is to convergence.
- Their formulation involves a direct modelling of the accumulation curve obtained from the sample.
Let \((D_n)_{n \geq 1}\) be a sequence of discovery indicators, such that
\[ D_n = \begin{cases} 1 & \text{if} \quad X_n = ``\text{new"}\\ 0 & \text{if} \quad X_n = ``\text{observed"} \end{cases} \] and trivially \(D_1 = 1\). Then, an accumulation curve \((K_n)_{n \geq 1}\) is defined as \[ K_n = \sum_{i = 1}^{n} D_i, \quad n \geq 1. \]
The goal of a Species Discovery model is threefold:
- Rarefy: obtain an in-sample estimator for the number of in-sample discoveries \(\mathbb{E}(K_n)\). This is a model-based rarefaction, as it estimates a functional form for the rarefaction curve.
- Extrapolate: predict the expected additional number of new species to be observed if more samples where obtained, \(\mathbb{E}(K_{n+m} \mid K_n = k)\) and \(m \geq 1\).
- Saturate: Estimate the sample richness \(\mathbb{E}(K_\infty\mid K_n = k)\) and the sample saturation \(\mathbb{E}(S_n\mid K_n = k)\), with \(S_n = K_n/K_\infty\).
All three goals are obtain by directly modelling the discovery probabilities linked to \(D_1, \ldots, D_n\). In particular, we let \(D_n \stackrel{iid}{\sim} Bernoulli(\pi_n)\), with
\[ \pi_{n+1} = \mathbb{P}(D_{n+1} = 1) = \mathcal{S}(n;\theta), \quad \theta \in \Theta \subset \mathbb{R}^p \] be the discovery probability and \(\mathcal{S}(n;\theta)\) a survival function such that
- \(S(0; \theta) = 1\) - the first observation is necessarily a new discovery,
- \(S(n; \theta) > S(n+1; \theta)\) - the discovery probabilities decrease over time,
- \(\lim_{n\to \infty}S(n;\theta) = 0\) - eventually, all species in the sample are discovered.
Then, any functional form that supports the three assumptions can work as a species discovery model. The package BNPvegan
supports two types of survival functions:
The three-parameter log-logistic model,
LL3
(the default): \[ \mathbb{P}(D_{n+1} = 1) = \frac{\alpha\phi^n}{\alpha\phi^n + n^{1-\sigma} } \] with \(\alpha >0\), \(\sigma <1\) and \(\phi \in (0,1)\). Notice that for \(\phi = 1\) and \(\sigma = 0\),LL3
has the same discovery probability of the Dirichlet process .The
Weibull
model: \[ \mathbb{P}(D_{n+1} = 1) = \phi^{n^\lambda} \] with \(\phi \in (0,1)\), \(\lambda > 0\).
Under both model, the value for \(K_\infty = \lim_{n\to \infty} K_n\) is guaranteed to converge. In other words, the species discovery model will always ensure a finite saturation.
The sdm
module
The sdm
module runs the model described above on the sample-based rarefaction curve of the specified vector of frequencies
. The available choice for model
are model = "LL3"
(default) and model = "Weibull"
. Both models are estimated via empirical Bayes by maximizing the log-likelihood associated to the collection of independent Bernoulli random variables. This quantity is equal to \[
\mathcal{L}(\theta\mid D_1, \ldots, D_n) = \prod_{i = 2}^n \mathcal{S}(i-1;\theta)^{D_n}(1-\mathcal{S}(i-1;\theta))^{1-D_n}
\] where the degenerate case \(i= 1\) is trivially excluded. As already mentioned above, it is important to notice that often the sequence of species discovery is not available. In the sample based-methods, this problem is solved by computing the rarefaction curve from the vector of given frequencies
, while in the ssm
module the exchangeability assumption implies that the ordering is irrelevant. In the sdm
case however, the order of the discoveries does matter. To solve this issue, the module maximizes the following likelihood: \[
\mathcal{L}(\theta\mid \tilde{D}_1, \ldots, \tilde{D}_n) = \prod_{i = 2}^n \mathcal{S}(i-1;\theta)^{\tilde{D}_n}(1-\mathcal{S}(i-1;\theta))^{1-\tilde{D}_n}
\] where \(\tilde{D}_i = \bar{K}_{i+1}- \bar{K}_i\) and \(\bar{K}_i\) is derived from the average sample-based rarefaction curve. Thus, sdm
works in three steps
- Compute the sample-based rarefaction curve \(\hat{K}_i\) for \(i = 1, \ldots, n\)
- Compute the differences and \(\tilde{D}_i = \hat{K}_{i+1}- \hat{K}_i\) with \(\tilde{D}_i = 1\)
- Obtain the empirical Bayes estimates for \(\theta\) via constrained maximization of the likelihood \(\mathcal{L}(\theta\mid \tilde{D}_1, \ldots, \tilde{D}_n)\).
In the case of model = "LL3
, the maximization of the likelihood can be obtained with a simple constrained logistic regression (package glmnet
). This is makes the sdm
function particularly efficient
# Fit the species discovery model.
# abundances <- fungalOTU
<- Lepidoptera
abundances
# Note: verbose controls the state of the construction of the sample-based rarefaction, which is
# the most demanding part
<- sdm(abundances, model = "LL3", verbose = FALSE)
fit_sdm
# Summarise the output
summary(fit_sdm)
## Model:
## Three-parameter log-logistic (LL3)
##
## Quantities:
## Abundance: 12548
## Richness: 240
## Estimated sample coverage: 0.9981
##
## Extrapolations:
## Expected species after 12548 additional samples: 252
## Expected new species after 12548 additional samples: 12
##
## Asymptotics:
## Expected species at infinity: 257
## Standard deviation at infinity: 16.04
## Expected new species to discover: 17
## Estimated saturation: 0.9339
##
## Parameters:
## alpha sigma phi logLik
## --------- ---------- ---------- ----------
## 35.03537 0.0500194 0.9999335 -868.7822
The quantities reported in the summary are:
Abundance
: the number of observation in the sample.Richness
: the number of distinct species in the sampleEstimated sample coverage
: model-based sample coverageExpected species after xx additional samples
: model-based rarefactionExpected new species after xx additional samples
: difference betweenExpected species after xx additional samples
andRichness
Expected species at infinity
: the asymptotic posterior species richness, ie. \(\mathbb{E}(K_\infty\mid K_n = k)\)Standard deviation at infinity
: the asymptotic standard deviation ie. \(\{Var(K_\infty\mid K_n = k)\}^{1/2}\)Expected new species to discover
: the difference betweenExpected species at infinity
andRichness
Sample saturation
: the ratio betweenRichness
andExpected species at infinity
.Parameters
the empirical Bayes estimates for the parameters of the model.
The parameters are easily extracted with the following commands
# Parameters
coef(fit_sdm)
## alpha sigma phi
## 35.03537487 0.05001943 0.99993348
# Loglikelihood
logLik(fit_sdm)
## [1] -868.7822
A rarefaction
and an extrapolation
estimator
The main advantage of the sdm
framework is that it is naturally endowed with a rarefaction
and an extrapolation
estimator. Indeed, we can prove that the number of distinct species \(K_n\) is distributed as a Poisson-binomial distribution, \[K_n \sim PB\{1,\mathcal{S}(1, \theta), \ldots,\mathcal{S}(n-1, \theta)\},\] with mean \[\mathbb{E}(K_n) = \sum_{i =1}^n \mathcal{S}(i-1, \theta).\] The above quantity can be obtained by using the rarefaction
method.
# Extract the model-based rarefaction curve for sdm
<- rarefaction(fit_sdm)
rar_sdm head(rar_sdm)
## [1] 1.000000 1.972248 2.919983 3.844996 4.748706 5.632318
tail(rar_sdm)
## [1] 239.9903 239.9922 239.9942 239.9961 239.9981 240.0000
# Plot the model-based and the sample-based rarefaction
plot(fit_sdm, type ="rarefaction")
As for extrapolation, it can be easily proved that for any \(m\geq 1\),
\[ \mathbb{E}(K_{n+m}\mid K_n = k) = k + \sum_{j = 0}^{n-m-1} \mathcal{S}(j, \theta). \]
Again, the method works in the same way as in the ssm
module
# Extrapolate in sdm
<- extrapolation(fit_sdm, m = 1:1000)
extr_sdm head(extr_sdm)
## [1] 240.0019 240.0039 240.0058 240.0078 240.0097 240.0116
tail(extr_sdm)
## [1] 241.8010 241.8027 241.8044 241.8061 241.8077 241.8094
# Plot in sdm
plot(fit_sdm, type = "extrapolation", m = 10000)
Saturation and asymptotic richness methods
The final feature of sdm
is the fact that it provides an estimate for the asymptotic species richness, which is the total number of distinct species present in the population. The estimator is equal to \[\mathbb{E}(K_\infty \mid K_n = k) = k + \sum_{j = n}^\infty S(j; \theta) <\infty.\] Moreover, \(K_\inf\mid K_n = k\) is a random variable. To simulate from it, run
### Simulate from the posterior distribution of the species richness
set.seed(1)
<- sample_Kinf(fit_sdm, size = 1000)
Kinf plot(density(Kinf), main = "Posterior species richness")
By sampling from the posterior species richness we can also assess the saturation level, which is the fraction of species discovered over the asymptotic estimate, namely \[S_n\mid K_n = k = \frac{k}{K_\infty}.\] As computing the expected value for this quantity can be demanding,
sdm
gives the possibility to compute \(k/\mathbb{E}(K_\infty \mid K_n = k)\), which is a first approximation. This can be achieved my specifying method = "approximate"
(the default). Otherwise, if one wants to carry uncertainty quantification on the saturation level, set method = "montecarlo"
and specify the number of samples n_samples
to draw. This last option can be slow. See the following code:
# Asymptotic richness estimators
asym_richness(fit_sdm)
## Exp. species at Inf sd at Inf
## 257.00000 16.03976
# Approximate saturation
saturation(fit_sdm, method = "approximate")
## [1] 0.9338521
# Samples from the posterior saturation
set.seed(1)
<- saturation(fit_sdm, method = "montecarlo", n_samples = 500)
post_sat plot(density(post_sat), main = "Posterior saturation")
Finally, we can compute the approximate number of additional samples to draw to reach a desired level of saturation. Can be obtained by specifying method = "target"
and a desired target
. Note that target
has to be between 0 and 0.999 for stability reasons.
# Additional samples to reach a desired level of saturation
saturation(fit_sdm, method = "target", target = 0.9)
## The approximate saturation is already 0.9339
## Please specify a higher target value
saturation(fit_sdm, method = "target", target = 0.99)
## [1] 18912
Acknowledgements
The build up of this package and the research have been funded by the LIFEPLAN project (European Union’s Horizon 2020 research and innovation program - grant agreement No 856506).
References
Duke University, alessandro.zito@duke.edu↩︎
University of Milano-Bicocca, tommaso.rigon@unimib.it↩︎