A tutorial of BayesANT

Bayesian nonparametric taxonomic classifier for DNA barcoding sequences

Introduction

Installation

# If the devtools R package is not already installed
# install.packages("devtools")
devtools::install_github("alessandrozito/BayesANT")
library(BayesANT)

Package description

BayesANT is an R package that runs the Bayesian nonparametric taxonomic classifier described in Zito, Rigon, and Dunson (2022) The code is available at this github repository. In this tutorial we provide an overview of the functions in the package, and we briefly describe the algorithm. For explicit mathematical details and estimation method, see the Supplementary material in Zito, Rigon, and Dunson (2022).

The package is structured around two core methods:

  • The method BayesANT, which trains the taxonomic classifier for a set of annotated DNA sequences. We endow it with a summary.BayesANT method as well to summarise the main details of the model.
  • The method predict.BayesANT, which can be used to annotate test DNA sequences.

Additional functions included in the package are

  • read.BayesANT.data, which loads a set of DNA sequences saved in a .fasta file in the format accepted by the BayesANT function.
  • read.BayesANT.testDNA, useful to load test DNA sequences for prediction.
  • add_novelty_to_test_data, which relabels the true annotations in the test set to account for novelty.
  • plot_accuracies, which displays the calibration plot for the predicted annotations if the true test labels are available.

In this tutorial, we show how to use BayesANT on a simulated reference library and a simulated set of test DNA sequences. Both files are accessible is .fasta format via the following code.

file.train <- system.file("extdata/trainDNA.fasta", package = "BayesANT")
file.test <- system.file("extdata/testDNA.fasta", package = "BayesANT")

The algorithm

BayesANT is short for BAYESiAn Nonparametric Taxonomic classifier and it is an off-the-shelf algorithm that annotates test DNA sequences probabilistically. The algorithm needs to be trained on a set of annotated DNA sequences up to a given taxonomic rank. When doing the classification, BayesANT accounts for both the taxa observed in training, and for the potential novel ones at every level of the taxonomy. It does so by adopting Pitman–Yor process priors (Perman, Pitman, and Yor 1992) over each node in the taxonomic tree. DNA sequences, instead, are modelled by means of a multinomial kernel with Dirichlet priors. We discuss the details in the following. Note: skip this section if you are solely interested in learning how to use the package.

The Pitman–Yor process

The Pitman–Yor process (Perman, Pitman, and Yor 1992) is a sequential allocation mechanism for species labels. In its urn scheme formulation, let \(V_1,\ldots, V_n\) be a sequence of taxon assignments for the DNA sequences in the training library, comprising of a total of \(K_n = k\) distinct labels denoted as \(V_1^*, \ldots, V_k^*\) and appearing with frequencies \(n_1, \ldots, n_k\). Then, the taxon of the \((n+1)\)st observation is determined as

\[ (V_{n+1} \mid V_1, \ldots, V_n) = \begin{cases} V_j^*, & \text{with prob.} \quad (n_j - \sigma)/(\alpha + n), \ j = 1, \ldots, k\\ \text{''new''}, & \text{with prob.} \quad (\alpha + \sigma k)/(\alpha + n), \end{cases} \] where \(\sigma \in [0,1)\) is a discount parameter governing the tail of the process and \(\alpha > -\sigma\) is a precision parameter. High values for \(\alpha\) and \(\sigma\) lead to a high number of distinct labels \(K_n\). Moreover, high values for \(n_j\) lead to a high probability that taxon \(V_j^*\) will be observed in the future. The figure below, borrowed from Zito, Rigon, and Dunson (2022), shows a practical illustration when \(\alpha = 1\) and \(\sigma= 0.25\).

Example of a Pitman–Yor process with \(n= 19\), \(\alpha = 1\), \(\sigma = 0.25\) and \(K_n=4\). Taxa names are reported on top of the circles, and frequency of appearance are written on the right to the blue DNA sequences, respectively. Fractions in black denote the taxon probabilities for the orange DNA sequence. For example, the probability of observing the butterfly-shaped taxon \(V_1^*\) is \((n_1 - \sigma)/(\alpha+n) = (10 - 0.25)/(19 + 1) = 39/80\). The probability for the unknown question mark taxon is \((\alpha + \sigma k)/(\alpha+n) = (1 + 4\times0.25)/(19 + 1) = 1/10\).

Here, we aim at assigning a label to the 20th DNA sequence appearing in the data, conditional on the fact that we have observed 4 different taxa already. Parameters \(\alpha\) and \(\sigma\) are estimated from the data via maximization of the exchangeable partition probability function (De Blasi et al. 2015). See the supplement in Zito, Rigon, and Dunson (2022) for explicit details.

Prior probabilities

BayesANT extends the Pitman–Yor process described above to every level in the taxonomic tree. In particular, consider a taxonomic library \(\mathcal{D}_n = (\mathbf{V}_i,\mathbf{X}_i )_{i=1}^n\) of size \(n\) and of \(L\geq 2\) levels, where \(\mathbf{V}_i = (V_{i, \ell})_{\ell =1}^L\) are the taxonomic annotations for DNA sequence \(\mathbf{X}_i\). Following the notation in Zito, Rigon, and Dunson (2022), let \(V_{j, \ell}^*\) be the \(j\)th taxon and \(\mathbf{V}_{\cdot, \ell}^{(n)} = (V_{i, \ell})_{i =1}^n\) be the sequence of taxa observed for level \(\ell\).

The first step is to construct the taxonomic tree. For a generic taxon \(v_\ell\in \mathcal{V}_\ell\) at level \(\ell\), where \(\mathcal{V}_\ell\) is the space of taxa at \(\ell\), define \(\textrm{pa}(v_\ell)\) as the unique parent node of \(v_\ell\) at level \(\ell -1\) and \(\rho_n(v_\ell)\) as the set of nodes \(v_{\ell+1}\) at level \(\ell+1\) such that \(\textrm{pa}(v_{\ell+1}) = v_\ell\). We also let \(K_n(v_\ell) = |\rho_n(v_\ell)|\) be the number of nodes linked to \(v_\ell\) at level \(\ell + 1\) and \(N_n(v_\ell)\) be the size of the taxon, namely the number of DNA sequences linked to \(v_\ell\). The prior prediction scheme becomes \[ (V_{n+1, \ell} \mid V_{n+1, \ell-1} = v_{\ell-1}, \mathbf{V}_{\cdot, \ell}^{(n)}) = \begin{cases} V_{j, \ell}^*, &\text{with prob.} \quad \big(N_n(V_{j, \ell}^*) - \sigma_\ell \big)/\big(\alpha_\ell + N_n(v_{\ell-1})\big), \\ \text{''new''}, & \text{with prob.} \quad \big(\alpha_\ell + \sigma_\ell K_n(v_{\ell-1})\big)/\big(\alpha_\ell + N_n(v_{\ell-1})\big), \end{cases} \] for all \(j\) such that \(\textrm{pa}(V_{j, \ell}^*) = v_{\ell -1}\) and where \(\sigma_\ell \in [0,1)\) and \(\alpha_\ell > -\sigma_\ell\) are rank-specific parameters. Parameters \(\alpha_\ell,\sigma_\ell\) are level-specific and allow for borrowing of information across branches. The figure below shows an example of a taxonomic tree divided into 3 levels (Phylum, Class and Order) and for which there are a total of \(n=28\) DNA sequences divided into 2 taxa at the first level, 5 at the second and 8 at the third. Under the assumption that a new node at level \(\ell\) automatically creates a new node at all levels \(\ell + 1, \ldots, L\) below, the total number of new leaves in this taxonomy is 8: one for every node at level 1 (2), one for every node at level 2 (5) and one entirely new (the branch with the ant-shaped symbol).

Example of a three-level taxonomic library. The total sample size of this example is \(n =28\). Circles in blue indicate nodes linked to leafs with observed DNA sequences, while circles in gray show all the possible missing or undiscovered branches.

The final prior probability for the \(n+1\)th DNA sequence is then product of Pitman–Yor probabilties, namely \[ \begin{split} \pi_{n+1}(v_L) &= \text{pr}(V_{n+1, L} = v_L \mid \mathcal{D}_n) \\ &= \text{pr}(V_{n+1, 1} = v_1 \mid \mathbf{V}_{\cdot, 1}^{(n)})\times \prod_{\ell=2}^{L} \text{pr}(V_{n+1, \ell}=v_{\ell}\mid V_{n+1, \ell-1}=v_{\ell-1}, \mathbf{V}_{\cdot, \ell}^{(n)}). \end{split} \]

Posterior probabilities

Once prior probabilities \(\pi_{n+1}(v_L)\) are determined, BayesANT specifies the posterior prediction probabilities by means of a multinomial kernel. In particular, we let \[ (\mathbf{X}_i \mid V_{i, L} = v_L, \boldsymbol{\theta}_{v_L}) \stackrel{\text{iid}}{\sim}\mathcal{K}(\mathbf{X}_i; \boldsymbol{\theta}_{v_L}), \] where \(\mathcal{K}(\cdot, \boldsymbol{\theta}_{v_L})\) is a distribution that depends on the leaf-specific vector of parameters \(\boldsymbol{\theta}_{v_L}\). Depending on the DNA sequences in training, BayesANT can take on two types of kernel

  • If the sequences are aligned and each is of length \(p\), we have a product-multinomial kernel, i.e.  \[ \mathcal{K}(\mathbf{X}_i; \boldsymbol{\theta}_{v}) = \prod_{s=1}^p\prod_{g\in \mathcal{N}_1} \theta_{v, s,g}^{\mathbf{1}\{X_{i,s} = g\}}, \] where \(\theta_{v,s,g}\) is the probability that nucleotide \(g \in \mathcal{N}_1 = \{A,C,G,T\}\) is observed at location \(s = 1, \ldots, p\) for taxon \(v\). If we further assume a conjugate Dirichlet prior over each parameter, namely \[\boldsymbol{\theta}_{v,s} \sim Dirichlet(\xi_{v,s,\text{A}}, \xi_{v,s,\text{C}}, \xi_{v,s,\text{G}}, \xi_{v,s,\text{T}})\] with \(\boldsymbol{\xi}_{v,s} = (\xi_{v,s,\text{A}}, \xi_{v,s,\text{C}}, \xi_{v,s,\text{G}}, \xi_{v,s,\text{T}})^\top\) being a vector of hyperparameters, the posterior distribution becomes \[ p(\boldsymbol{\theta}_v \mid \mathcal{D}_n) \propto \prod_{s=1}^p \prod_{g\in \mathcal{N}_1} \theta_{v, s,g}^{\xi_{v,s,g} + n_{v,s,g}}. \] where \(n_{v,s,g} = \sum_{i: V_{i, L} = v} \mathbf{1}\{X_{i,s} = g\}\) indicates the number of times nucleotide \(g \in \mathcal{N}_1\) is recorded at locus \(s\) for the DNA sequences linked to leaf \(v\).

  • If the sequences are not aligned, we specify a multinomial kernel over the \(\kappa\)-mer decomposition of the sequence, i.e.  \[ \mathcal{K}(\mathbf{X}_i; \boldsymbol{\theta}_{v}) = \prod_{g=1}^{4^\kappa} \theta_{v, g}^{n_{i,g}} \] where \(n_{i,g}\) is the totol number of times \(\kappa\)-mer \(g\) appeares in the sequence \(\boldsymbol{X}_i\). Under a Dirichlet prior similar to the one above, the prosterior distribution is again a Dirichlet.

Irrespective of the kernel chosen, the prediction probability detailed by BayesANT are computed as

\[ p_{n+1}(v_L) \propto \pi_{n+1}(v_L) \int \mathcal{K}(\mathbf{X}_{n+1}; \boldsymbol{\theta}_{v_L}) p(\boldsymbol{\theta}_{v_L}\mid \mathcal{D}_n)\mathrm{d}\boldsymbol{\theta}_{v_L}, \] where \(p_{n+1}(v_L) = \text{pr}(V_{n+1, L} = v_L \mid \mathbf{X}_{n+1}, \mathcal{D}_n)\) iand \(v_L\) is a generic leaf in the taxonomic tree. If \(v_L\) is an unobserved leaf, we replace the posterior \(p(\boldsymbol{\theta}_{v_L}\mid \mathcal{D}_n)\) with a prior \(p(\boldsymbol{\theta}_{v_L})\) setting th nucleotide counts to 0. In the aligned case, for example the prediction rule is

\[ p_{n+1}(v_L) \propto \begin{cases} \pi_{n+1}(v_L) \prod_{s=1}^p (\xi_{v_L,s, g_s} + n_{v_L,s,g_{s}})/M_{v_L,s}, & \textrm{if } v_L \textrm{ is an observed leaf} \\ \pi_{n+1}(v_L) \prod_{s=1}^p \xi_{v_L,s, g_s}/\xi_{v,s,0} , & \textrm{if } v_L \textrm{ is a novel leaf} \end{cases} \] where \(M_{v,s} =\sum_{g \in \mathcal{N}_1} (\xi_{v,s,g} + n_{v,s,g})\) is a normalizing constant for the nucleotide probabilities and \(\xi_{v,s,0} =\sum_{g \in \mathcal{N}_1} \xi_{v,s,g}\) is the sum of the hyperparameters at location \(s\) for leaf \(v\). These are tuned via method of the moments automatically, as detailed in the supplementary material in Zito, Rigon, and Dunson (2022).

Prediction rule

Once all the posterior probabilities are defined for all leaves, BayesANT annotates the DNA sequence \(\mathbf{X}_{n+1}\) with the labels \((v_\ell^*)_{\ell=1}^L\) that satisfy \[\begin{equation}\label{eq:PredictionRule} v^*_\ell = {\arg\max}_{v_\ell \in \rho_n(v^*_{\ell-1})} \sum_{v_L \in \mathcal{L}_n(v_\ell)} p_{n+1}(v_L), \end{equation}\] where \(\mathcal{L}_n(v_\ell)\) is the set of all leaves linked to node \(v_\ell\) in a library of \(n\) DNA sequences. As we detail below, in the package we also allow the option to return not only the most likely taxon, but the leaves with the highest probabilities.

When specifying the prediction probabilities, BayesANT introduces an additional parameter, \(\rho \in (0,1]\), which can be tuned to recalibrate the output. In particular, In particular, we post-process the prediction probabilities \(p_{n+1}(v_L)\) as \[ \tilde{p}_{n+1}(v_L) = \frac{p_{n+1}(v_L)^\rho}{\sum_{v \in \mathcal{V}_L} p_{n+1}(v)^\rho}, \] which are later used in our prediction algorithm. Such a strategy does not alter the ranking of the original probabilities since the transformation is monotonic. Moreover, if \(p_{n+1}(v_L)=1\), then also \(\tilde{p}_{n+1}(v_L)=1\), implying that the overall accuracy is barely modified. Choices for \(\rho\) can be adopted via cross validation. In particular, prediction probabilities are calibrated if the average probability for the predicted nodes is equal to the classification accuracy. For example, if 90% of the sequences are correctly classified, ideally the average classification probability is approximately 0.9. An average value of 0.5 and of 0.99, instead, means that the algorithm is too conservative when right and too confident when wrong, respectively. The default value of \(\rho = 0.1\) works well in practice, but it is recommendable to use it carefully when classifying.

The BayesANT module

The taxonomic classifier described above is all implemented in the package. In what follows, we detail how train the classifier with the BayesANT module.

Step 1: load DNA sequences

As a first step, we need to load the DNA sequences into the environment. The package is endowed with a set of training DNA sequences already, stored in .fasta file. We have chosen this format since it is the most used when working with DNA data. To see the file, run the following commands.

# Find the link to the training file 
file.train <- system.file("extdata/trainDNA.fasta", package = "BayesANT")
# See how the data are stored
readLines(file.train, n = 8)
## [1] ">ID_2 Root;V*-1,1;V*-1,2;V*-1,3;V*-1,4"                      
## [2] "GACTTGTCGGGTGGCGCCTAGTCGCGGCCAAAAACACACAGGTCATACTGCTGATTTCAG"
## [3] "CCAAAATTACTCGCGCCTCTGCGAGCTCTTGGCAGATAAGGTCACGCCTTGTACAACCCT"
## [4] "TAGCTGTGACAGTATCTCAATGATTAAGACCGAGCGTAGCCGGTCCCGACCTCATGCAGC"
## [5] "AGAGGTTCACGGTAAATGTTCGTTATGGTTCCCGAGGGCTCCACTGTTAAGCGTAAGCAC"
## [6] "ACGGCTCGCTGCCCTCACAAACCAGGGTGAGCGGGCGGCGTATACATACTACACCGGTCT"
## [7] "CAGGCTTTAAACGGTGGATCATGTATTAGACCCCGCACGGAGCGGGTGTCTACCACAATC"
## [8] "TATAAATACCGTACCTATCCCCAGCTCTGCATGGTTCCCT"

The output above corresponds to the first DNA sequence in the file. The name of the sequence is ID\_2, and its annotation is Root;V*-1,1;V*-1,2;V*-1,3;V*-1,4 and comprises of 4 taxonomic ranks. Here, Root is the root of the tree, and V\*-1,1 is the node at the first taxonomic level. Names are given to mirro the notation above, since this is a simulated dataset. Notice that each annotation must start with the Root, and names must be separated with a semicolon ;. Note: to work with BayesANT, all training DNA sequences must follow this nomenclature. Failure in doing so results in an incorrect use of the package.

To facilitate the correct use of this data for our package, we have created the function read.BayesANT.data, which returns an object of class c('data.frame', 'BayesANT.data').

# Load training data 
rank <- 4
rank_names <- c("Phylum","Class", "Order", "Family")
data <- read.BayesANT.data(fasta.file = file.train, 
                           rank = rank, 
                           rank_names = rank_names)
# See some random observations
set.seed(10)
data[sample(nrow(data), size = 5), ]
##         Phylum   Class    Order   Family
## ID_662  V*-1,1  V*-2,2  V*-26,3 V*-128,4
## ID_4951 V*-3,1 V*-22,2 V*-237,3 V*-854,4
## ID_4518 V*-3,1 V*-21,2 V*-228,3 V*-838,4
## ID_5944 V*-3,1 V*-22,2 V*-237,3 V*-863,4
## ID_7443 V*-3,1 V*-23,2 V*-254,3 V*-981,4
##                                                                                                                                                                                                                                                                                                                                                                                                                      DNA
## ID_662  GACTTGTCGGGTGGCGCCTAGTCGCTGTCGAAATAACACAGGCCATACTCCGCTTGACGGGCACAATTTCAATCGACCCATGCAGCCTTGGACAGAGAAGGTCACGACCCGGACTACCGTCAGCAACGACGGAAACTCAGTGATGAAGGCCGATCGTAGCGGGTCCCGACTCCATGCTGCATAGCATCAAGGTAAAGGCTCATTATGGTTCCAGATGACTCCACTGTCAAGCGTTAGCACACTCCTCGCTGACGTCAATAAACAGGGGTAAAGGGCCGAGTATACATACTACACTTCTCTCTGGCTTTAAACTGTGGATCATGTCTTAGACCCCGCCCATAGCTCATGATTTCAAGAAATTATCATTACCGCAAGTCTCTACTGCTCAGCATGGTTCCCT
## ID_4951 GACTTGTCGGGTGGCGCCTAGTCGCGGTGAATATACCAGTTCCTCCGTCTCGCTGCCCAGCCATAATTTGAGAGTCCTTAGGCGGGCCGGGTGAGGCAAGGTAACGGCGCTGACAATTTGCAGCGACGGCACTATATTCGTGACTAAAACCTATAATTGATATCCACTGCTGTATCCTGCAGAGAATCGAGTGAACTATCAGGTGTGGCTCCCGAGGGGGCCGGCGACGAGGGTTAGCACAGCCGCAGCGTTAATTGAAAACCAGGATGAAAAGTACGCGGAGGCATTCAACCCCTGACTCTGCTTTTATGAGGTCGCTCGAGTGACCGAGCCCGGTGCCAGCTGCCAGCTTCCAGAAATTGTGATTACCGGCATTCGTTGCAGCTCATCATGCTTGACT
## ID_4518 GACTTGTCGGGTGGCGCCTAGTCGCGGCCGGAATCTAACTGCCCCTACTTCGCTTGCCTGCCAAACGTTCAATGGCCTCAGTCAGACCGTGGCAAAAAACGTCACGCCCGCGACAACCATCACTTGTGTGGCTCCCTCAGTGCTTAACACCTATCATTGCCGGTCCCTACTACGTATTACAGAGAATCACGTTACATGTTAAGTGTGGTACCCAAGGCCTCCGCTGTCGATCGTTAGAACCGCAATAGCTGTCATGCAAAACCTTACTGAAATCGCTGCGTATTCATACAACCCCAGACTCTGGTTTTGACTGGCCAACCGTGTGATAGAACCAGGCCTCAGCGGGTGACTTCTAGTAATTATCATTACCGTCACTAATTACAGCTCAGCATGGTTCCCT
## ID_5944 GACTTGTCGGGTGGCGCCTAGTCGCGGTCAAGATTCCACTTCCTCCGTGGAGCTCCCCAGCCATAATATCATTGTCCTAAGGCCGGGCGGGTGAGGCAACGTCACGGCGCTGACAACTCGCAGCGACGGCGCTATGTTAGTGACTAAAACGTATAATTGCTATCCCCTGCTGTATCCTTCAGAGAATCGCGAGAAATACCAGGTGTGGCTCCCGAGGGCGCCGGCCACGAGGGTTAGCACAGCCGTAGCGCTAATTGAAAACCAGGGTGAAATGTACGCGTATGCATTCAACCCCTAACTCTGCTTTTATGCGGTCGTTAGAGTGACCGAGCCCGATGCCAGCGGGTAACTTCGAGAATATGTGATTAACGGCATTCGTTGCCGCTCATCATGCTTCGCT
## ID_7443 GACTTGTCGGGTGGCGCCTAGTCGCGACCGAAGTCACACGGCGAATAGTGCGCTGGCCTGCCGAAGTTACAATGGCCAAAGGCAACCCGTGAGAGAGAAAGTCAGGCACCTGACCCTCGTCAGTGGTTACGGTATCGCTGTGATTAAGCCCTATCATTGCCGGTCCCGACTCTGGTTTACAGAGAATCCCGTTTCATGCAATTTATGGTTCACGAGGTCTTCGCTGTCGAGCGTTAGCACAGACGTCGCTGTCCTCCAAAACCCGTGAGCTATGGCCGCGAATACATAGAACCCCGGTCCCGCGATCCAACCGGTGGATCATATGACAGACCCCATTCACAGCGGGTGACTTCGAGAATTTAAGACTACCGTGCCTATTGACAACTCGGCCTGGGTCCCC

When loading the data, one can specify the rank at which to load the sequences (default is the lowest observed), and the rank.names (default are Level1 up to Level4 in this case). The last column is data$DNA, and contains the DNA sequences in upper case and in string format. In this example, we have a set of \(7500\) aligned sequences of length \(p=400\), annotated across \(4\) taxonomic levels.

# Access the length of the DNA sequences stores in the last column
length_DNA <- stringr::str_length(data$DNA)
quantile(length_DNA)
##   0%  25%  50%  75% 100% 
##  400  400  400  400  400

One useful tool to evaluate the sequences in the library is to look at the location-specific Gini index (a measure for heterogeneity). We plot it with the following functions:

## Function to compute the Gini index
Gini <- function(freq){
  freq_rel <- freq / sum(freq)
  1 - sum(freq_rel^2)
}

## Apply the function to all the sequences in the training data
Gini_location <- apply(stringr::str_split(string = data$DNA, 
                                          pattern = "", simplify = T), 2,function(x) Gini(table(x)))
## Make the plot
plot(Gini_location)

Here, we see that the first 25 loci in the aligned DNA sequences are a conserved region, since they are all equal across the library. The rest, instead, is quite heterogeneous.

Step 2: train the classifier

Once the training library is correctly loaded, we are ready to train the classifier. We do this as follows:

## Train the classifier under the aligned kernel
classifier <- BayesANT(data = data,
                       typeseq = "aligned", 
                       type_location = "single", 
                       newtaxa = TRUE,
                       verbose = TRUE)
## Welcome to BayesANT 🐜 🐜 🐜 🐜  
## No missing annotations detected. 
## Estimating the Pitman-Yor parameters: Now. 
## Estimating the Pitman-Yor parameters: Done. 
## Computing prior probabilities: Now 
## Computing prior probabilities: Done. 
## Extracting nucleotide location counts from sequences: Now. 
## Extracting nucleotide location counts from sequences: Done. 
## Tuning hyperparameters: Now. 
## Tuning hyperparameters: Done. 
## Aggregating parameters: Now 
## Aggregating parameters: Done

The function BayesANT has the following arguments:

  • data: an object loaded with the function read.BayesANT.data
  • typeseq: type of sequences loaded in the classifier. Options are aligned and not aligned. This parameters determines which type of multinomial kernel to use. Selecting typeseq = 'aligned' requires sequences to be all of the same length.
  • type_location: what type of aligned kernel to use. Options are single for the standard multinomial kernel over nucleotides ACGT, and pairs for a multinomial kernel over the nucleotide pairs AA, AC, AG…
  • kmers: \(\kappa\)-mer length for the not aligned kernel. Recommendable only if sequences are truly not aligned.
  • newtaxa: logical. Whether to include new taxa in the training of the algorithm. If FALSE, it ignores the Pitman–Yor prior and classifies only by looking at the taxa observed in training.
  • verbose: if TRUE, monitor the progress of the training step.

In this tutorial, we use the case of the aligned set of sequences since our train data are all of the same length. The not aligned version follows easily by setting typeseq = not aligned and kmers = 6 (or any desired number).

To summarise the output of a classifier, run:

# Summary of the classifier
summary(classifier)
## BayesANT classifier 🐜🐜🐜🐜
## 
## Number of DNA sequences in the library: 7500
## Type of sequences: aligned
## Average length: 400
## Number of nodes in the taxonomy: 2813
## Number of new nodes added to the taxonomy: 691
## 
## Taxonomic tree and estimated parameters:
##            Observed taxa   New Taxa   alpha   sigma      loglik
##  -------  --------------  ---------  ------  ------  ----------
##  Phylum               17          1    2.00    0.00   -11252.82
##  Class                66         18    0.95    0.00    -7947.93
##  Order               504         84    2.88    0.18   -11814.21
##  Family             1535        588    7.91    0.00   -11181.47

Here, we report the details of the taxonomic tree and the estimated level-specific prior parameters \(\alpha_\ell\) and \(\sigma_\ell\).

Prediction

Once the classifier has been trained, we are ready to annotate the test query sequences. The package is endowed with test sequences as well, generated from a similar process as the training ones. We load them with the function read.BayesANT.testDNA as follows:

# Load the test sequences
file.test <- system.file("extdata/testDNA.fasta", package = "BayesANT")
testDNA <- read.BayesANT.testDNA(file.test)
testDNA[1]
##                                                                                                                                                                                                                                                                                                                                                                                                               ID_1 
## "GACTTGTCGGGTGGCGCCTAGTCGCGGCCAAAAACACACAGGTCATACTGCTGCTTTCAGCCAAAATTACTCTCGCCTCTGCGAGCTCTTGGCAGATAAGGTCACGCCTTGTACAACGCTTAGCTGTGACAGTTTCTCAATGATTAAGACCGAGCGTAGCCGGTCCCGACCTCATGCAGCAGAGGTTCACGGTCAATGTTCGTTATGGTTCCCGAGGGCTCCACTGTTAAGCGTAAGCACACTCCTCGCTGCCCTCACAAACCAGGGTGAGCGGGCGGCGTATACATACTACACCGGTCTCAGGCTTTAAACGGTGGATCATGTATTTGACCACGCCCGTAGCGGGTGTCTACCACAATCTATAAATACCGTACCTATCCCCAGCTCTGCATGGTTCCCT"

This function loads the DNA sequences saved in .fasta file as a named vector of strings. Then, to annotate the sequence we use the standard predict method:

# Annotate a test sequence
annot <- predict(classifier, 
                 DNA = testDNA[1], 
                 rho = 0.1, 
                 return_probs = TRUE, 
                 n_top_taxa = 5, 
                 cores = 1)

The function predict.BayesANT has the following arguments:

  • object: a object of class BayesANT, i.e the output of a BayesANT classifier
  • DNA: a named vector of DNA strings, possibly loaded with the function read.BayesANT.testDNA
  • rho: temperature parameter to calibrate the probabilities. Default is rho=0.1
  • return_probs: whether to return the highest leaf probabilities
  • n_top_taxa: number of leaves to return. Valid if return_probs=TRUE
  • cores: Number of cores to run the prediction in parallel, implemented via the packages foreach and doParallel. Default is cores=1 which is equivalent to a for loop.

When we specify return_probs=TRUE, the package returns the following:

## Predicted annotation for testDNA[1]
annot$prediction
##      Phylum  Class  Order Family Prob_Phylum Prob_Class Prob_Order Prob_Family
## ID_1 V*-1,1 V*-1,2 V*-1,3 V*-1,4           1  0.9999999  0.9999982   0.9816346
## First 5 leaves with the highest probability
annot$top_n_probs
## $ID_1
##     Phylum  Class  Order            Family   leaf_prob
## 2   V*-1,1 V*-1,2 V*-1,3            V*-1,4 0.981634599
## 287 V*-1,1 V*-1,2 V*-1,3 V*-1,3_Family_new 0.005526119
## 6   V*-1,1 V*-1,2 V*-1,3            V*-5,4 0.005358812
## 7   V*-1,1 V*-1,2 V*-1,3            V*-6,4 0.005358812
## 4   V*-1,1 V*-1,2 V*-1,3            V*-3,4 0.001051397

Notice that using an aligned kernel requires that the training library and the test library should be aligned with the same algorithm and should have the same length.

Before predicting the whole test set. Let’s look at one additional example. To verify if BayesANT reasonably recognizes novelty, we create two mock DNA sequences with a totally different generating process from the ones observed in the data. If BayesANT works correctly, they should all identify a new branch at the first level.

## Extract the conserved region
set.seed(42)
region <- substr(data$DNA[1],1, 25)
mockDNA <- rep(0, 3)
## Generate a mock query at random
mockDNA[1] <- paste(sample(size = 400, x = c("A", "C", "G", "T"), replace = T), 
                    collapse="")
## Generate a mock query by repeating ACGT continuously
mockDNA[2] <- paste(rep("ACGT", 100), collapse = "")
## Generate a mock query by copying the conserved region and add one nucleotide
mockDNA[3] <- paste0(region, paste(rep("C", 375), collapse = ""))
## Generate a mock query by copying the conserved region and applying randomness
mockDNA[4] <- paste0(region, paste(sample(size = 375, x = c("A", "C", "G", "T"), 
                                          replace = T), collapse=""))
## Assign names
names(mockDNA) <- c("random", "ACGT", "region_C", "region_Random")
## Annotate it
predict(classifier, mockDNA)
##                   Phylum      Class      Order     Family Prob_Phylum
## random        Phylum_new Phylum_new Phylum_new Phylum_new   0.9993315
## ACGT          Phylum_new Phylum_new Phylum_new Phylum_new   0.9999718
## region_C      Phylum_new Phylum_new Phylum_new Phylum_new   0.9999228
## region_Random Phylum_new Phylum_new Phylum_new Phylum_new   0.9981647
##               Prob_Class Prob_Order Prob_Family
## random         0.9993315  0.9993315   0.9993315
## ACGT           0.9999718  0.9999718   0.9999718
## region_C       0.9999228  0.9999228   0.9999228
## region_Random  0.9981647  0.9981647   0.9981647

In all these cases, the algorithm is still able to predict the Phylum novelty with high probability.

With this in mind, we are ready to predict the full test set:

# Predict the whole test set
out <- predict(classifier, DNA = testDNA, rho = 0.1, verbose = T)
## Number of sequences predicted =  250 / 2500  
## Number of sequences predicted =  500 / 2500  
## Number of sequences predicted =  750 / 2500  
## Number of sequences predicted =  1000 / 2500  
## Number of sequences predicted =  1250 / 2500  
## Number of sequences predicted =  1500 / 2500  
## Number of sequences predicted =  1750 / 2500  
## Number of sequences predicted =  2000 / 2500  
## Number of sequences predicted =  2250 / 2500  
## Number of sequences predicted =  2500 / 2500
head(out)
##       Phylum  Class  Order Family Prob_Phylum Prob_Class Prob_Order Prob_Family
## ID_1  V*-1,1 V*-1,2 V*-1,3 V*-1,4           1  0.9999999  0.9999982   0.9816346
## ID_9  V*-1,1 V*-1,2 V*-1,3 V*-1,4           1  0.9999999  0.9999986   0.9756333
## ID_10 V*-1,1 V*-1,2 V*-1,3 V*-1,4           1  0.9999996  0.9999836   0.9379793
## ID_18 V*-1,1 V*-1,2 V*-1,3 V*-2,4           1  0.9999998  0.9999910   0.9125458
## ID_30 V*-1,1 V*-1,2 V*-1,3 V*-2,4           1  0.9999989  0.9999765   0.6045719
## ID_40 V*-1,1 V*-1,2 V*-1,3 V*-2,4           1  0.9999985  0.9999542   0.5777873

We evaluate the performance of the classifier in the next section.

Accuracies and calibration

When evaluating the performance of a taxonomic classifier, it is necessary to distinguish between which true test labels are observed in training, and which ones are new. If the annotations of the test sequences are known, BayesANT provides a function, called add_novelty_to_test_data, that relabels observations if their labels do not appear in the test set. For a given query sequence, the relabeling works as follows:

  • If all true taxa are all observed in training, do nothing
  • If the there is a new taxon unobserved at some level, this is substituted with 'name_of_last_observed_taxon' + 'name_of_level_where_novelty_appears' + 'new'. For example, if a new Family is detected within the Order of Lepidoptera, its name will be Lepidoptera_Family_new for the Family level up until the last in the taxonomy. The name of a new taxon at level one in this case is Phylum_new.

Identifying new leaves is useful to evaluate the accuracy of the classifier. To do this, run the following commands:

# Load the test data
test_data <- read.BayesANT.data(file.test, rank_names = rank_names)
# Include novelty in the test data for correct labelling
test_data <- add_novelty_to_test_data(data = data, test_data = test_data)
# Which ones are new?
id_new <- grepl("_new$", test_data$Level4)
# Print the classification accuracy
plot_accuracies(prediction = out, test_data = test_data)

The resulting plot depicts how calibrated the probabilities are. Values close to the diagonal line indicate perfect calibration. If the algorithm shows miscalibration, consider adjusting the parameter rho when predicting. This can be done visually by plotting the output accuracies, or more formally via cross validation.

Acknowledgements

The build up of this package and the research have been funded by the LIFEPLAN project (European Union’s Horizon 2020 research and innovation program - grant agreement No 856506).

References

De Blasi, P., S. Favaro, A. Lijoi, R. H. Mena, I. Prunster, and M. Ruggiero. 2015. “Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process?” IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2): 212–29.
Perman, M., J. Pitman, and M. Yor. 1992. “Size-Biased Sampling of Poisson Point Processes and Excursions.” Probability Theory and Related Fields 92 (1): 21–39.
Zito, A., T. Rigon, and D. B. Dunson. 2022. “Inferring Taxonomic Placement from DNAbarcoding Allowing Discovery of New Taxa.” arXiv:2201.09782.