Taxonomy in classify seqs

Hello,

Perhaps this is a bad question but I’m trying to get my head around how mothur assigns taxonomy at different levels.

Basically I’m wondering if mothur “knows” what all of the taxonomy levels are that are in the taxonomy file? Or is it entirely probability based on k-mers, so actually we could just give the family and genus or the genus and species in the taxonomy file and it would be able to do the assignments?

I ask because I’m trying to assign taxonomy to sequences, but for reasons I won’t get into here I need to use different taxonomy, I have the genus and species for everything in my database, and mothur is working with my database/taxonomy file no problem.

I’m just not sure if what I’ve done makes sense…

Can anyone offer some insight?
Thanks

I’m not sure I really follow… The assignments are based on kmers that it generates from the taxonomy and fasta file that you give it. It’s the exact algorithm used in the Wang et al. AEM “Naive Bayesian…” paper.

Okay, maybe I’m asking a bad question, I’m not an expert at this… but going off this from the classify seqs web page.

“The method looks at all taxonomies represented in the template, and calculates the probability a sequence from a given taxonomy would contain a specific kmer. Then calculates the probability a query sequence would be in a given taxonomy based on the kmers it contains, and assign the query sequence to the taxonomy with the highest probability”

So this would make me think that that the labels in the taxonomy don’t need to start at phylum, as long as they are consistent. That instead of starting at phylum, they could start at Family, and as long as that was the case for every sequence in the database than classifySeqs should work just fine?

Like I was saying… mothur doesn’t “know” what each taxonomy level is… it doesn’t know that phylum is first and class is second, it just see’s a label and and calculates the probability that a sequence matches that label based on the kmers it contains. Thus I can make a taxonomy/reference file where every sequence only has genus and species, and mothur should classify these as well as if it had the full taxonomy for them…

Am I making any sense? :shock:

So this would make me think that that the labels in the taxonomy don’t need to start at phylum, as long as they are consistent. That instead of starting at phylum, they could start at Family, and as long as that was the case for every sequence in the database than classifySeqs should work just fine?

That’s correct. As we use it, a sequence is really only classified to the bottom level (like the genus). Then it does a bootstrapping to figure out what percentage of the bootstraps have that classification as well as the classification of the levels above it. So if you just gave it family through genus then it would back fill the probabilities to family instead of to phylum or kingdom.

Thus I can make a taxonomy/reference file where every sequence only has genus and species, and mothur should classify these as well as if it had the full taxonomy for them…

Right - that should work. The only place i could see it falling down is that many genera will have similar kmers. This is why we don’t get a high percentage of reads that go to the genus level when we classify them. I suspect you would wind up with a lot of ‘unknowns’ where it can’t classify a sequence to any genus. If you can, it would probably be better to supplement the current taxonomies with the species label than to just use genus and species.