classify.otu output question

Hello everyone,

I had a couple of questions about the output files from classify.otu for example (taken from the Miseq SOP):

OTU007 6 Bacteria(100);“Proteobacteria”(100);Betaproteobacteria(100);Neisseriales(100);Neisseriaceae(100);Neisseria(100)

I know this is saying that OTU007 was observed 6 times in my sample and 100% of the sequences were being classified as Neisseria. But if I had an entry such as

Otu00100 68 Bacteria(100) Proteobacteria(100) Alphaproteobacteria(100) Rhizobiales(100) Bradyrhizobiaceae(100) Bradyrhizobium(96)

Is this saying that Otu00100 was observed 68 times but 96% of the sequences fit that OTU? Or does this mean that of those 68 sequences they collectively match 96% of the time or is it something entirely different that I’m not seeing?

Additionally, Some of OTU output I see is like this (see below) where there is no number designation after the unclassified.

Otu00707 2 Bacteria(100) Bacteroidetes(100) Sphingobacteria(100) Sphingobacteriales(100) Sphingobacteriaceae(100) unclassified

What does this mean?

Thank you for all your help!

  • Jake

Hi Jake

And please, someone corrects me if I am wrong here.

The classify.seqs command uses by default the Wang method, implemented in RDP Classifier as well. From the wiki:

When finding the taxonomy of a given query sequence in the fasta file, the wang method looks at the query sequence kmer by kmer. The method looks at all taxonomies represented in the template, and calculates the probability a sequence from a given taxonomy would contain a specific kmer. Then calculates the probability a query sequence would be in a given taxonomy based on the kmers it contains, and assign the query sequence to the taxonomy with the highest probability. This method also runs a bootstrapping algorithmn to find the confidence limit of the assignment by randomly choosing with replacement 1/8 of the kmers in the query and then finding the taxonomy. This is the method that is implemented by the RDP and is described by Wang et al.

So, when you say

I know this is saying that OTU007 was observed 6 times in my sample and 100% of the sequences were being classified as Neisseria.

Is not exaclty like that but that you “unique” sequence representing OTU7 was classified as Neisseria with 100 confidence after bootstrapping.

In the other case, your OTU100 was classified as Bradyrhizobiaceae(100) Bradyrhizobium(96), therefore being 100 confident the classification at family level and 96% or sth close in concept the classification at genus level, at the given parameters for bootstrapping. More or less the same that happens with the nodes of a tree when calculated and the branching soported by bootstrapping.

In your last case, Sphingobacteriaceae(100) unclassified means that your OTU was classified at family level with 100% confidence according to the bootstrapping parameters, but that no classification reached the threshold (did you use 80 as cutoff in classify.seqs?) for genus level. This can be due to no sequence present in the database with enough similarity with your unique sequence for this OTU. This is not so strange if you work with let´s say environmental samples, as a lot of species and even genus are not yet described.

Hope this helps!

Susana

Is this saying that Otu00100 was observed 68 times but 96% of the sequences fit that OTU?

Yep

Additionally, Some of OTU output I see is like this (see below) where there is no number designation after the unclassified.

Otu00707 2 Bacteria(100) Bacteroidetes(100) Sphingobacteria(100) Sphingobacteriales(100) Sphingobacteriaceae(100) unclassified

We don’t calculate the consensus unclassified if they are all unclassified.

Hi,

i needed a classification on the below data… All the otu’s shows it has Pseudomonas at genus level… however, it shows 97,96, 86 etc… SO i am kind of lost here as how to interpret. As jake said its
“96% of the sequences fit that OTU?”… i have many otu’s representing the pseudomonas… can someone help me interpret… Also is there any way to classify to species level in mothur…

Otu0001 9275 Bacteria(100);“Proteobacteria”(100);Gammaproteobacteria(100);Pseudomonadales(100);Pseudomonadaceae(99);Pseudomonas(97);
Otu0002 819 Bacteria(100);“Proteobacteria”(100);Gammaproteobacteria(100);Pseudomonadales(100);Pseudomonadaceae(100);Pseudomonas(96);
Otu0003 61 Bacteria(100);“Proteobacteria”(100);Gammaproteobacteria(100);Pseudomonadales(92);Pseudomonadaceae(91);Pseudomonas(86);
Otu0004 43 Bacteria(100);“Proteobacteria”(100);Gammaproteobacteria(100);Pseudomonadales(100);Pseudomonadaceae(100);Pseudomonas(91);
Otu0005 20 Bacteria(100);“Proteobacteria”(100);Gammaproteobacteria(100);Pseudomonadales(100);Pseudomonadaceae(100);Pseudomonas(100);

thanks in advance…
rosh

These are 5 OTUs, where, I presume, the sequences within each are on average not more than 3% different from each other. If it helps, you can think of these 5 OTUs as different species. Of the sequences within each of these OTUs, 97, 96, 86, 91, and 100% of them were classified as members of the Pseudomonas. Assuming you used an 80% threshold in classify.seqs, you can be pretty confident that this is the situation.

Pat

Hi Pat

At looking at Rosh’s question and your answer, I think I need help with sth I cannot understand. Pseudomonas is a genus in which some of the species (those belonging to the same group as defined ie by Mulet et al) usually share around 99% id between their 16S sequences (and still belong to different species). If you have OTUs, like those 5 here, and you know the sequences within each OTU share up to 97% id. How is that possible that you end up with these 5 different OTUs and not only one including all sequences, and that within those OTUs (except the last one) you have a percentage of sequences (as high as 8% in case of OTU3) that even sharing 97% id with the rest of the sequences in the OTU could not be classified either at order level? is that “biologically” possible or is showing some “error” in the sequences? may be due to a no so proper filtering and/or chimera removal?

I had sometimes cases like this one here, and I wandered how was that possible, and found here the chance to ask you what do you think about :slight_smile:

Susana

The Pseudomonas is a notorious taxonomic garbage can that is very diverse. It’s not at all surprisigin that there is broad variation in its members. Also, you seem to be conflating the distance threshold with taxonomic levels. This is dangerous and not warranted. It’s likely that the most diverse Pseudomonas sequences differ by 15%.

Pat

Hi Pat, may I quote here the roshanbernard question?:

I’m facing the same problem with paired-end Illumina datasets, and a species-level identification would be appreciated: there’s a particular align file or taxonomy file to use within Mothur to achieve the species-level classification?

Thank you for your kindness!

Valerio

Some of the lineages (~10%) in the greengenes taxonomy go to the species level. I still think that if you want “species” then you should do OTUs.

Pat