A tutorial of BayesANT
Bayesian nonparametric taxonomic classifier for DNA barcoding sequences
Introduction
Installation
# If the devtools R package is not already installed
# install.packages("devtools")
::install_github("alessandrozito/BayesANT")
devtoolslibrary(BayesANT)
Package description
BayesANT
is an R package that runs the Bayesian nonparametric taxonomic classifier described in Zito, Rigon, and Dunson (2022) The code is available at this github repository. In this tutorial we provide an overview of the functions in the package, and we briefly describe the algorithm. For explicit mathematical details and estimation method, see the Supplementary material in Zito, Rigon, and Dunson (2022).
The package is structured around two core methods:
- The method
BayesANT
, which trains the taxonomic classifier for a set of annotated DNA sequences. We endow it with asummary.BayesANT
method as well to summarise the main details of the model. - The method
predict.BayesANT
, which can be used to annotate test DNA sequences.
Additional functions included in the package are
read.BayesANT.data
, which loads a set of DNA sequences saved in a.fasta
file in the format accepted by theBayesANT
function.read.BayesANT.testDNA
, useful to load test DNA sequences for prediction.add_novelty_to_test_data
, which relabels the true annotations in the test set to account for novelty.plot_accuracies
, which displays the calibration plot for the predicted annotations if the true test labels are available.
In this tutorial, we show how to use BayesANT on a simulated reference library and a simulated set of test DNA sequences. Both files are accessible is .fasta
format via the following code.
<- system.file("extdata/trainDNA.fasta", package = "BayesANT")
file.train <- system.file("extdata/testDNA.fasta", package = "BayesANT") file.test
The algorithm
BayesANT is short for BAYESiAn Nonparametric Taxonomic classifier and it is an off-the-shelf algorithm that annotates test DNA sequences probabilistically. The algorithm needs to be trained on a set of annotated DNA sequences up to a given taxonomic rank
. When doing the classification, BayesANT accounts for both the taxa observed in training, and for the potential novel ones at every level of the taxonomy. It does so by adopting Pitman–Yor process priors (Perman, Pitman, and Yor 1992) over each node in the taxonomic tree. DNA sequences, instead, are modelled by means of a multinomial kernel with Dirichlet priors. We discuss the details in the following. Note: skip this section if you are solely interested in learning how to use the package.
The Pitman–Yor process
The Pitman–Yor process (Perman, Pitman, and Yor 1992) is a sequential allocation mechanism for species labels. In its urn scheme formulation, let \(V_1,\ldots, V_n\) be a sequence of taxon assignments for the DNA sequences in the training library, comprising of a total of \(K_n = k\) distinct labels denoted as \(V_1^*, \ldots, V_k^*\) and appearing with frequencies \(n_1, \ldots, n_k\). Then, the taxon of the \((n+1)\)st observation is determined as
\[ (V_{n+1} \mid V_1, \ldots, V_n) = \begin{cases} V_j^*, & \text{with prob.} \quad (n_j - \sigma)/(\alpha + n), \ j = 1, \ldots, k\\ \text{''new''}, & \text{with prob.} \quad (\alpha + \sigma k)/(\alpha + n), \end{cases} \] where \(\sigma \in [0,1)\) is a discount parameter governing the tail of the process and \(\alpha > -\sigma\) is a precision parameter. High values for \(\alpha\) and \(\sigma\) lead to a high number of distinct labels \(K_n\). Moreover, high values for \(n_j\) lead to a high probability that taxon \(V_j^*\) will be observed in the future. The figure below, borrowed from Zito, Rigon, and Dunson (2022), shows a practical illustration when \(\alpha = 1\) and \(\sigma= 0.25\).
Here, we aim at assigning a label to the 20th DNA sequence appearing in the data, conditional on the fact that we have observed 4 different taxa already. Parameters \(\alpha\) and \(\sigma\) are estimated from the data via maximization of the exchangeable partition probability function (De Blasi et al. 2015). See the supplement in Zito, Rigon, and Dunson (2022) for explicit details.
Prior probabilities
BayesANT extends the Pitman–Yor process described above to every level in the taxonomic tree. In particular, consider a taxonomic library \(\mathcal{D}_n = (\mathbf{V}_i,\mathbf{X}_i )_{i=1}^n\) of size \(n\) and of \(L\geq 2\) levels, where \(\mathbf{V}_i = (V_{i, \ell})_{\ell =1}^L\) are the taxonomic annotations for DNA sequence \(\mathbf{X}_i\). Following the notation in Zito, Rigon, and Dunson (2022), let \(V_{j, \ell}^*\) be the \(j\)th taxon and \(\mathbf{V}_{\cdot, \ell}^{(n)} = (V_{i, \ell})_{i =1}^n\) be the sequence of taxa observed for level \(\ell\).
The first step is to construct the taxonomic tree. For a generic taxon \(v_\ell\in \mathcal{V}_\ell\) at level \(\ell\), where \(\mathcal{V}_\ell\) is the space of taxa at \(\ell\), define \(\textrm{pa}(v_\ell)\) as the unique parent node of \(v_\ell\) at level \(\ell -1\) and \(\rho_n(v_\ell)\) as the set of nodes \(v_{\ell+1}\) at level \(\ell+1\) such that \(\textrm{pa}(v_{\ell+1}) = v_\ell\). We also let \(K_n(v_\ell) = |\rho_n(v_\ell)|\) be the number of nodes linked to \(v_\ell\) at level \(\ell + 1\) and \(N_n(v_\ell)\) be the size of the taxon, namely the number of DNA sequences linked to \(v_\ell\). The prior prediction scheme becomes \[ (V_{n+1, \ell} \mid V_{n+1, \ell-1} = v_{\ell-1}, \mathbf{V}_{\cdot, \ell}^{(n)}) = \begin{cases} V_{j, \ell}^*, &\text{with prob.} \quad \big(N_n(V_{j, \ell}^*) - \sigma_\ell \big)/\big(\alpha_\ell + N_n(v_{\ell-1})\big), \\ \text{''new''}, & \text{with prob.} \quad \big(\alpha_\ell + \sigma_\ell K_n(v_{\ell-1})\big)/\big(\alpha_\ell + N_n(v_{\ell-1})\big), \end{cases} \] for all \(j\) such that \(\textrm{pa}(V_{j, \ell}^*) = v_{\ell -1}\) and where \(\sigma_\ell \in [0,1)\) and \(\alpha_\ell > -\sigma_\ell\) are rank-specific parameters. Parameters \(\alpha_\ell,\sigma_\ell\) are level-specific and allow for borrowing of information across branches. The figure below shows an example of a taxonomic tree divided into 3 levels (Phylum, Class and Order) and for which there are a total of \(n=28\) DNA sequences divided into 2 taxa at the first level, 5 at the second and 8 at the third. Under the assumption that a new node at level \(\ell\) automatically creates a new node at all levels \(\ell + 1, \ldots, L\) below, the total number of new leaves in this taxonomy is 8: one for every node at level 1 (2), one for every node at level 2 (5) and one entirely new (the branch with the ant-shaped symbol).
The final prior probability for the \(n+1\)th DNA sequence is then product of Pitman–Yor probabilties, namely \[ \begin{split} \pi_{n+1}(v_L) &= \text{pr}(V_{n+1, L} = v_L \mid \mathcal{D}_n) \\ &= \text{pr}(V_{n+1, 1} = v_1 \mid \mathbf{V}_{\cdot, 1}^{(n)})\times \prod_{\ell=2}^{L} \text{pr}(V_{n+1, \ell}=v_{\ell}\mid V_{n+1, \ell-1}=v_{\ell-1}, \mathbf{V}_{\cdot, \ell}^{(n)}). \end{split} \]
Posterior probabilities
Once prior probabilities \(\pi_{n+1}(v_L)\) are determined, BayesANT specifies the posterior prediction probabilities by means of a multinomial kernel. In particular, we let \[ (\mathbf{X}_i \mid V_{i, L} = v_L, \boldsymbol{\theta}_{v_L}) \stackrel{\text{iid}}{\sim}\mathcal{K}(\mathbf{X}_i; \boldsymbol{\theta}_{v_L}), \] where \(\mathcal{K}(\cdot, \boldsymbol{\theta}_{v_L})\) is a distribution that depends on the leaf-specific vector of parameters \(\boldsymbol{\theta}_{v_L}\). Depending on the DNA sequences in training, BayesANT can take on two types of kernel
If the sequences are
aligned
and each is of length \(p\), we have a product-multinomial kernel, i.e. \[ \mathcal{K}(\mathbf{X}_i; \boldsymbol{\theta}_{v}) = \prod_{s=1}^p\prod_{g\in \mathcal{N}_1} \theta_{v, s,g}^{\mathbf{1}\{X_{i,s} = g\}}, \] where \(\theta_{v,s,g}\) is the probability that nucleotide \(g \in \mathcal{N}_1 = \{A,C,G,T\}\) is observed at location \(s = 1, \ldots, p\) for taxon \(v\). If we further assume a conjugate Dirichlet prior over each parameter, namely \[\boldsymbol{\theta}_{v,s} \sim Dirichlet(\xi_{v,s,\text{A}}, \xi_{v,s,\text{C}}, \xi_{v,s,\text{G}}, \xi_{v,s,\text{T}})\] with \(\boldsymbol{\xi}_{v,s} = (\xi_{v,s,\text{A}}, \xi_{v,s,\text{C}}, \xi_{v,s,\text{G}}, \xi_{v,s,\text{T}})^\top\) being a vector of hyperparameters, the posterior distribution becomes \[ p(\boldsymbol{\theta}_v \mid \mathcal{D}_n) \propto \prod_{s=1}^p \prod_{g\in \mathcal{N}_1} \theta_{v, s,g}^{\xi_{v,s,g} + n_{v,s,g}}. \] where \(n_{v,s,g} = \sum_{i: V_{i, L} = v} \mathbf{1}\{X_{i,s} = g\}\) indicates the number of times nucleotide \(g \in \mathcal{N}_1\) is recorded at locus \(s\) for the DNA sequences linked to leaf \(v\).If the sequences are
not aligned
, we specify a multinomial kernel over the \(\kappa\)-mer decomposition of the sequence, i.e. \[ \mathcal{K}(\mathbf{X}_i; \boldsymbol{\theta}_{v}) = \prod_{g=1}^{4^\kappa} \theta_{v, g}^{n_{i,g}} \] where \(n_{i,g}\) is the totol number of times \(\kappa\)-mer \(g\) appeares in the sequence \(\boldsymbol{X}_i\). Under a Dirichlet prior similar to the one above, the prosterior distribution is again a Dirichlet.
Irrespective of the kernel chosen, the prediction probability detailed by BayesANT are computed as
\[ p_{n+1}(v_L) \propto \pi_{n+1}(v_L) \int \mathcal{K}(\mathbf{X}_{n+1}; \boldsymbol{\theta}_{v_L}) p(\boldsymbol{\theta}_{v_L}\mid \mathcal{D}_n)\mathrm{d}\boldsymbol{\theta}_{v_L}, \] where \(p_{n+1}(v_L) = \text{pr}(V_{n+1, L} = v_L \mid \mathbf{X}_{n+1}, \mathcal{D}_n)\) iand \(v_L\) is a generic leaf in the taxonomic tree. If \(v_L\) is an unobserved leaf, we replace the posterior \(p(\boldsymbol{\theta}_{v_L}\mid \mathcal{D}_n)\) with a prior \(p(\boldsymbol{\theta}_{v_L})\) setting th nucleotide counts to 0. In the aligned case, for example the prediction rule is
\[ p_{n+1}(v_L) \propto \begin{cases} \pi_{n+1}(v_L) \prod_{s=1}^p (\xi_{v_L,s, g_s} + n_{v_L,s,g_{s}})/M_{v_L,s}, & \textrm{if } v_L \textrm{ is an observed leaf} \\ \pi_{n+1}(v_L) \prod_{s=1}^p \xi_{v_L,s, g_s}/\xi_{v,s,0} , & \textrm{if } v_L \textrm{ is a novel leaf} \end{cases} \] where \(M_{v,s} =\sum_{g \in \mathcal{N}_1} (\xi_{v,s,g} + n_{v,s,g})\) is a normalizing constant for the nucleotide probabilities and \(\xi_{v,s,0} =\sum_{g \in \mathcal{N}_1} \xi_{v,s,g}\) is the sum of the hyperparameters at location \(s\) for leaf \(v\). These are tuned via method of the moments automatically, as detailed in the supplementary material in Zito, Rigon, and Dunson (2022).
Prediction rule
Once all the posterior probabilities are defined for all leaves, BayesANT annotates the DNA sequence \(\mathbf{X}_{n+1}\) with the labels \((v_\ell^*)_{\ell=1}^L\) that satisfy \[\begin{equation}\label{eq:PredictionRule} v^*_\ell = {\arg\max}_{v_\ell \in \rho_n(v^*_{\ell-1})} \sum_{v_L \in \mathcal{L}_n(v_\ell)} p_{n+1}(v_L), \end{equation}\] where \(\mathcal{L}_n(v_\ell)\) is the set of all leaves linked to node \(v_\ell\) in a library of \(n\) DNA sequences. As we detail below, in the package we also allow the option to return not only the most likely taxon, but the leaves with the highest probabilities.
When specifying the prediction probabilities, BayesANT introduces an additional parameter, \(\rho \in (0,1]\), which can be tuned to recalibrate the output. In particular, In particular, we post-process the prediction probabilities \(p_{n+1}(v_L)\) as \[ \tilde{p}_{n+1}(v_L) = \frac{p_{n+1}(v_L)^\rho}{\sum_{v \in \mathcal{V}_L} p_{n+1}(v)^\rho}, \] which are later used in our prediction algorithm. Such a strategy does not alter the ranking of the original probabilities since the transformation is monotonic. Moreover, if \(p_{n+1}(v_L)=1\), then also \(\tilde{p}_{n+1}(v_L)=1\), implying that the overall accuracy is barely modified. Choices for \(\rho\) can be adopted via cross validation. In particular, prediction probabilities are calibrated if the average probability for the predicted nodes is equal to the classification accuracy. For example, if 90% of the sequences are correctly classified, ideally the average classification probability is approximately 0.9. An average value of 0.5 and of 0.99, instead, means that the algorithm is too conservative when right and too confident when wrong, respectively. The default value of \(\rho = 0.1\) works well in practice, but it is recommendable to use it carefully when classifying.
The BayesANT
module
The taxonomic classifier described above is all implemented in the package. In what follows, we detail how train the classifier with the BayesANT
module.
Step 1: load DNA sequences
As a first step, we need to load the DNA sequences into the environment. The package is endowed with a set of training DNA sequences already, stored in .fasta
file. We have chosen this format since it is the most used when working with DNA data. To see the file, run the following commands.
# Find the link to the training file
<- system.file("extdata/trainDNA.fasta", package = "BayesANT")
file.train # See how the data are stored
readLines(file.train, n = 8)
## [1] ">ID_2 Root;V*-1,1;V*-1,2;V*-1,3;V*-1,4"
## [2] "GACTTGTCGGGTGGCGCCTAGTCGCGGCCAAAAACACACAGGTCATACTGCTGATTTCAG"
## [3] "CCAAAATTACTCGCGCCTCTGCGAGCTCTTGGCAGATAAGGTCACGCCTTGTACAACCCT"
## [4] "TAGCTGTGACAGTATCTCAATGATTAAGACCGAGCGTAGCCGGTCCCGACCTCATGCAGC"
## [5] "AGAGGTTCACGGTAAATGTTCGTTATGGTTCCCGAGGGCTCCACTGTTAAGCGTAAGCAC"
## [6] "ACGGCTCGCTGCCCTCACAAACCAGGGTGAGCGGGCGGCGTATACATACTACACCGGTCT"
## [7] "CAGGCTTTAAACGGTGGATCATGTATTAGACCCCGCACGGAGCGGGTGTCTACCACAATC"
## [8] "TATAAATACCGTACCTATCCCCAGCTCTGCATGGTTCCCT"
The output above corresponds to the first DNA sequence in the file. The name of the sequence is ID\_2
, and its annotation is Root;V*-1,1;V*-1,2;V*-1,3;V*-1,4
and comprises of 4 taxonomic ranks. Here, Root
is the root of the tree, and V\*-1,1
is the node at the first taxonomic level. Names are given to mirro the notation above, since this is a simulated dataset. Notice that each annotation must start with the Root
, and names must be separated with a semicolon ;
. Note: to work with BayesANT, all training DNA sequences must follow this nomenclature. Failure in doing so results in an incorrect use of the package.
To facilitate the correct use of this data for our package, we have created the function read.BayesANT.data
, which returns an object of class c('data.frame', 'BayesANT.data')
.
# Load training data
<- 4
rank <- c("Phylum","Class", "Order", "Family")
rank_names <- read.BayesANT.data(fasta.file = file.train,
data rank = rank,
rank_names = rank_names)
# See some random observations
set.seed(10)
sample(nrow(data), size = 5), ] data[
## Phylum Class Order Family
## ID_662 V*-1,1 V*-2,2 V*-26,3 V*-128,4
## ID_4951 V*-3,1 V*-22,2 V*-237,3 V*-854,4
## ID_4518 V*-3,1 V*-21,2 V*-228,3 V*-838,4
## ID_5944 V*-3,1 V*-22,2 V*-237,3 V*-863,4
## ID_7443 V*-3,1 V*-23,2 V*-254,3 V*-981,4
## DNA
## ID_662 GACTTGTCGGGTGGCGCCTAGTCGCTGTCGAAATAACACAGGCCATACTCCGCTTGACGGGCACAATTTCAATCGACCCATGCAGCCTTGGACAGAGAAGGTCACGACCCGGACTACCGTCAGCAACGACGGAAACTCAGTGATGAAGGCCGATCGTAGCGGGTCCCGACTCCATGCTGCATAGCATCAAGGTAAAGGCTCATTATGGTTCCAGATGACTCCACTGTCAAGCGTTAGCACACTCCTCGCTGACGTCAATAAACAGGGGTAAAGGGCCGAGTATACATACTACACTTCTCTCTGGCTTTAAACTGTGGATCATGTCTTAGACCCCGCCCATAGCTCATGATTTCAAGAAATTATCATTACCGCAAGTCTCTACTGCTCAGCATGGTTCCCT
## ID_4951 GACTTGTCGGGTGGCGCCTAGTCGCGGTGAATATACCAGTTCCTCCGTCTCGCTGCCCAGCCATAATTTGAGAGTCCTTAGGCGGGCCGGGTGAGGCAAGGTAACGGCGCTGACAATTTGCAGCGACGGCACTATATTCGTGACTAAAACCTATAATTGATATCCACTGCTGTATCCTGCAGAGAATCGAGTGAACTATCAGGTGTGGCTCCCGAGGGGGCCGGCGACGAGGGTTAGCACAGCCGCAGCGTTAATTGAAAACCAGGATGAAAAGTACGCGGAGGCATTCAACCCCTGACTCTGCTTTTATGAGGTCGCTCGAGTGACCGAGCCCGGTGCCAGCTGCCAGCTTCCAGAAATTGTGATTACCGGCATTCGTTGCAGCTCATCATGCTTGACT
## ID_4518 GACTTGTCGGGTGGCGCCTAGTCGCGGCCGGAATCTAACTGCCCCTACTTCGCTTGCCTGCCAAACGTTCAATGGCCTCAGTCAGACCGTGGCAAAAAACGTCACGCCCGCGACAACCATCACTTGTGTGGCTCCCTCAGTGCTTAACACCTATCATTGCCGGTCCCTACTACGTATTACAGAGAATCACGTTACATGTTAAGTGTGGTACCCAAGGCCTCCGCTGTCGATCGTTAGAACCGCAATAGCTGTCATGCAAAACCTTACTGAAATCGCTGCGTATTCATACAACCCCAGACTCTGGTTTTGACTGGCCAACCGTGTGATAGAACCAGGCCTCAGCGGGTGACTTCTAGTAATTATCATTACCGTCACTAATTACAGCTCAGCATGGTTCCCT
## ID_5944 GACTTGTCGGGTGGCGCCTAGTCGCGGTCAAGATTCCACTTCCTCCGTGGAGCTCCCCAGCCATAATATCATTGTCCTAAGGCCGGGCGGGTGAGGCAACGTCACGGCGCTGACAACTCGCAGCGACGGCGCTATGTTAGTGACTAAAACGTATAATTGCTATCCCCTGCTGTATCCTTCAGAGAATCGCGAGAAATACCAGGTGTGGCTCCCGAGGGCGCCGGCCACGAGGGTTAGCACAGCCGTAGCGCTAATTGAAAACCAGGGTGAAATGTACGCGTATGCATTCAACCCCTAACTCTGCTTTTATGCGGTCGTTAGAGTGACCGAGCCCGATGCCAGCGGGTAACTTCGAGAATATGTGATTAACGGCATTCGTTGCCGCTCATCATGCTTCGCT
## ID_7443 GACTTGTCGGGTGGCGCCTAGTCGCGACCGAAGTCACACGGCGAATAGTGCGCTGGCCTGCCGAAGTTACAATGGCCAAAGGCAACCCGTGAGAGAGAAAGTCAGGCACCTGACCCTCGTCAGTGGTTACGGTATCGCTGTGATTAAGCCCTATCATTGCCGGTCCCGACTCTGGTTTACAGAGAATCCCGTTTCATGCAATTTATGGTTCACGAGGTCTTCGCTGTCGAGCGTTAGCACAGACGTCGCTGTCCTCCAAAACCCGTGAGCTATGGCCGCGAATACATAGAACCCCGGTCCCGCGATCCAACCGGTGGATCATATGACAGACCCCATTCACAGCGGGTGACTTCGAGAATTTAAGACTACCGTGCCTATTGACAACTCGGCCTGGGTCCCC
When loading the data, one can specify the rank
at which to load the sequences (default is the lowest observed), and the rank.names
(default are Level1
up to Level4
in this case). The last column is data$DNA
, and contains the DNA sequences in upper case and in string format. In this example, we have a set of \(7500\) aligned sequences of length \(p=400\), annotated across \(4\) taxonomic levels.
# Access the length of the DNA sequences stores in the last column
<- stringr::str_length(data$DNA)
length_DNA quantile(length_DNA)
## 0% 25% 50% 75% 100%
## 400 400 400 400 400
One useful tool to evaluate the sequences in the library is to look at the location-specific Gini index (a measure for heterogeneity). We plot it with the following functions:
## Function to compute the Gini index
<- function(freq){
Gini <- freq / sum(freq)
freq_rel 1 - sum(freq_rel^2)
}
## Apply the function to all the sequences in the training data
<- apply(stringr::str_split(string = data$DNA,
Gini_location pattern = "", simplify = T), 2,function(x) Gini(table(x)))
## Make the plot
plot(Gini_location)
Here, we see that the first 25 loci in the aligned DNA sequences are a conserved region, since they are all equal across the library. The rest, instead, is quite heterogeneous.
Step 2: train the classifier
Once the training library is correctly loaded, we are ready to train the classifier. We do this as follows:
## Train the classifier under the aligned kernel
<- BayesANT(data = data,
classifier typeseq = "aligned",
type_location = "single",
newtaxa = TRUE,
verbose = TRUE)
## Welcome to BayesANT 🐜 🐜 🐜 🐜
## No missing annotations detected.
## Estimating the Pitman-Yor parameters: Now.
## Estimating the Pitman-Yor parameters: Done.
## Computing prior probabilities: Now
## Computing prior probabilities: Done.
## Extracting nucleotide location counts from sequences: Now.
## Extracting nucleotide location counts from sequences: Done.
## Tuning hyperparameters: Now.
## Tuning hyperparameters: Done.
## Aggregating parameters: Now
## Aggregating parameters: Done
The function BayesANT
has the following arguments:
data
: an object loaded with the functionread.BayesANT.data
typeseq
: type of sequences loaded in the classifier. Options arealigned
andnot aligned
. This parameters determines which type of multinomial kernel to use. Selectingtypeseq = 'aligned'
requires sequences to be all of the same length.type_location
: what type ofaligned
kernel to use. Options aresingle
for the standard multinomial kernel over nucleotides ACGT, andpairs
for a multinomial kernel over the nucleotide pairs AA, AC, AG…kmers
: \(\kappa\)-mer length for thenot aligned
kernel. Recommendable only if sequences are truly not aligned.newtaxa
: logical. Whether to include new taxa in the training of the algorithm. IfFALSE
, it ignores the Pitman–Yor prior and classifies only by looking at the taxa observed in training.verbose
: ifTRUE
, monitor the progress of the training step.
In this tutorial, we use the case of the aligned
set of sequences since our train data are all of the same length. The not aligned version follows easily by setting typeseq = not aligned
and kmers = 6
(or any desired number).
To summarise the output of a classifier, run:
# Summary of the classifier
summary(classifier)
## BayesANT classifier 🐜🐜🐜🐜
##
## Number of DNA sequences in the library: 7500
## Type of sequences: aligned
## Average length: 400
## Number of nodes in the taxonomy: 2813
## Number of new nodes added to the taxonomy: 691
##
## Taxonomic tree and estimated parameters:
## Observed taxa New Taxa alpha sigma loglik
## ------- -------------- --------- ------ ------ ----------
## Phylum 17 1 2.00 0.00 -11252.82
## Class 66 18 0.95 0.00 -7947.93
## Order 504 84 2.88 0.18 -11814.21
## Family 1535 588 7.91 0.00 -11181.47
Here, we report the details of the taxonomic tree and the estimated level-specific prior parameters \(\alpha_\ell\) and \(\sigma_\ell\).
Prediction
Once the classifier has been trained, we are ready to annotate the test query sequences. The package is endowed with test sequences as well, generated from a similar process as the training ones. We load them with the function read.BayesANT.testDNA
as follows:
# Load the test sequences
<- system.file("extdata/testDNA.fasta", package = "BayesANT")
file.test <- read.BayesANT.testDNA(file.test)
testDNA 1] testDNA[
## ID_1
## "GACTTGTCGGGTGGCGCCTAGTCGCGGCCAAAAACACACAGGTCATACTGCTGCTTTCAGCCAAAATTACTCTCGCCTCTGCGAGCTCTTGGCAGATAAGGTCACGCCTTGTACAACGCTTAGCTGTGACAGTTTCTCAATGATTAAGACCGAGCGTAGCCGGTCCCGACCTCATGCAGCAGAGGTTCACGGTCAATGTTCGTTATGGTTCCCGAGGGCTCCACTGTTAAGCGTAAGCACACTCCTCGCTGCCCTCACAAACCAGGGTGAGCGGGCGGCGTATACATACTACACCGGTCTCAGGCTTTAAACGGTGGATCATGTATTTGACCACGCCCGTAGCGGGTGTCTACCACAATCTATAAATACCGTACCTATCCCCAGCTCTGCATGGTTCCCT"
This function loads the DNA sequences saved in .fasta
file as a named vector of strings. Then, to annotate the sequence we use the standard predict
method:
# Annotate a test sequence
<- predict(classifier,
annot DNA = testDNA[1],
rho = 0.1,
return_probs = TRUE,
n_top_taxa = 5,
cores = 1)
The function predict.BayesANT
has the following arguments:
object
: a object of classBayesANT
, i.e the output of a BayesANT classifierDNA
: a named vector of DNA strings, possibly loaded with the functionread.BayesANT.testDNA
rho
: temperature parameter to calibrate the probabilities. Default isrho=0.1
return_probs
: whether to return the highest leaf probabilitiesn_top_taxa
: number of leaves to return. Valid ifreturn_probs=TRUE
cores
: Number of cores to run the prediction in parallel, implemented via the packagesforeach
anddoParallel
. Default iscores=1
which is equivalent to a for loop.
When we specify return_probs=TRUE
, the package returns the following:
## Predicted annotation for testDNA[1]
$prediction annot
## Phylum Class Order Family Prob_Phylum Prob_Class Prob_Order Prob_Family
## ID_1 V*-1,1 V*-1,2 V*-1,3 V*-1,4 1 0.9999999 0.9999982 0.9816346
## First 5 leaves with the highest probability
$top_n_probs annot
## $ID_1
## Phylum Class Order Family leaf_prob
## 2 V*-1,1 V*-1,2 V*-1,3 V*-1,4 0.981634599
## 287 V*-1,1 V*-1,2 V*-1,3 V*-1,3_Family_new 0.005526119
## 6 V*-1,1 V*-1,2 V*-1,3 V*-5,4 0.005358812
## 7 V*-1,1 V*-1,2 V*-1,3 V*-6,4 0.005358812
## 4 V*-1,1 V*-1,2 V*-1,3 V*-3,4 0.001051397
Notice that using an aligned kernel requires that the training library and the test library should be aligned with the same algorithm and should have the same length.
Before predicting the whole test set. Let’s look at one additional example. To verify if BayesANT reasonably recognizes novelty, we create two mock DNA sequences with a totally different generating process from the ones observed in the data. If BayesANT works correctly, they should all identify a new branch at the first level.
## Extract the conserved region
set.seed(42)
<- substr(data$DNA[1],1, 25)
region <- rep(0, 3)
mockDNA ## Generate a mock query at random
1] <- paste(sample(size = 400, x = c("A", "C", "G", "T"), replace = T),
mockDNA[collapse="")
## Generate a mock query by repeating ACGT continuously
2] <- paste(rep("ACGT", 100), collapse = "")
mockDNA[## Generate a mock query by copying the conserved region and add one nucleotide
3] <- paste0(region, paste(rep("C", 375), collapse = ""))
mockDNA[## Generate a mock query by copying the conserved region and applying randomness
4] <- paste0(region, paste(sample(size = 375, x = c("A", "C", "G", "T"),
mockDNA[replace = T), collapse=""))
## Assign names
names(mockDNA) <- c("random", "ACGT", "region_C", "region_Random")
## Annotate it
predict(classifier, mockDNA)
## Phylum Class Order Family Prob_Phylum
## random Phylum_new Phylum_new Phylum_new Phylum_new 0.9993315
## ACGT Phylum_new Phylum_new Phylum_new Phylum_new 0.9999718
## region_C Phylum_new Phylum_new Phylum_new Phylum_new 0.9999228
## region_Random Phylum_new Phylum_new Phylum_new Phylum_new 0.9981647
## Prob_Class Prob_Order Prob_Family
## random 0.9993315 0.9993315 0.9993315
## ACGT 0.9999718 0.9999718 0.9999718
## region_C 0.9999228 0.9999228 0.9999228
## region_Random 0.9981647 0.9981647 0.9981647
In all these cases, the algorithm is still able to predict the Phylum novelty with high probability.
With this in mind, we are ready to predict the full test set:
# Predict the whole test set
<- predict(classifier, DNA = testDNA, rho = 0.1, verbose = T) out
## Number of sequences predicted = 250 / 2500
## Number of sequences predicted = 500 / 2500
## Number of sequences predicted = 750 / 2500
## Number of sequences predicted = 1000 / 2500
## Number of sequences predicted = 1250 / 2500
## Number of sequences predicted = 1500 / 2500
## Number of sequences predicted = 1750 / 2500
## Number of sequences predicted = 2000 / 2500
## Number of sequences predicted = 2250 / 2500
## Number of sequences predicted = 2500 / 2500
head(out)
## Phylum Class Order Family Prob_Phylum Prob_Class Prob_Order Prob_Family
## ID_1 V*-1,1 V*-1,2 V*-1,3 V*-1,4 1 0.9999999 0.9999982 0.9816346
## ID_9 V*-1,1 V*-1,2 V*-1,3 V*-1,4 1 0.9999999 0.9999986 0.9756333
## ID_10 V*-1,1 V*-1,2 V*-1,3 V*-1,4 1 0.9999996 0.9999836 0.9379793
## ID_18 V*-1,1 V*-1,2 V*-1,3 V*-2,4 1 0.9999998 0.9999910 0.9125458
## ID_30 V*-1,1 V*-1,2 V*-1,3 V*-2,4 1 0.9999989 0.9999765 0.6045719
## ID_40 V*-1,1 V*-1,2 V*-1,3 V*-2,4 1 0.9999985 0.9999542 0.5777873
We evaluate the performance of the classifier in the next section.
Accuracies and calibration
When evaluating the performance of a taxonomic classifier, it is necessary to distinguish between which true test labels are observed in training, and which ones are new. If the annotations of the test sequences are known, BayesANT provides a function, called add_novelty_to_test_data
, that relabels observations if their labels do not appear in the test set. For a given query sequence, the relabeling works as follows:
- If all true taxa are all observed in training, do nothing
- If the there is a new taxon unobserved at some level, this is substituted with
'name_of_last_observed_taxon' + 'name_of_level_where_novelty_appears' + 'new'
. For example, if a new Family is detected within the Order of Lepidoptera, its name will beLepidoptera_Family_new
for the Family level up until the last in the taxonomy. The name of a new taxon at level one in this case isPhylum_new
.
Identifying new leaves is useful to evaluate the accuracy of the classifier. To do this, run the following commands:
# Load the test data
<- read.BayesANT.data(file.test, rank_names = rank_names)
test_data # Include novelty in the test data for correct labelling
<- add_novelty_to_test_data(data = data, test_data = test_data)
test_data # Which ones are new?
<- grepl("_new$", test_data$Level4)
id_new # Print the classification accuracy
plot_accuracies(prediction = out, test_data = test_data)
The resulting plot depicts how calibrated the probabilities are. Values close to the diagonal line indicate perfect calibration. If the algorithm shows miscalibration, consider adjusting the parameter rho
when predicting. This can be done visually by plotting the output accuracies, or more formally via cross validation.
Acknowledgements
The build up of this package and the research have been funded by the LIFEPLAN project (European Union’s Horizon 2020 research and innovation program - grant agreement No 856506).
References
Duke University, alessandro.zito@duke.edu↩︎