classifying seqs, rdp vs silva

I was poking around the world of taxonomy (eek, I’m happy thinking otu1, otu2 but some people insist on attaching names), the Tenericutes in particular. Someone had analyzed their pilot project in qiime (greengenes taxonomy) and were concerned when my analysis of their complete dataset didn’t have a predominance of Tenericutes like the qiime analysis did. I have a strong feeling that their qiime Tenericutes are mostly Erysipelotrichia which are common in my analysis using the RDP trainset for classification-RDP just puts them in the Firmicutes. Which lead me to looking through you taxonomy options and I’m a bit confused as to what you recommend. The Miseq SOP says RDP trainset for classification but in the Silva 119 readme, you say that you suggest using the silva_nr for classification.

Can you give a brief (yet strongly opinionated) run down of which taxonomy to use when?

Can you give a brief (yet strongly opinionated) run down of which taxonomy to use when?

I think we can be friends :). Flip a coin?

As far as which is “best”, that really depends on the environment and specific bugs in your dataset. Some will point to greengenes because it goes to the species level, but that is only for about 10% of the sequences in the database. I would perhaps pick “best” by classifying to the three databases and see which does the best job of classifying to the deepest level and then go with that. But that really only addresses which classifies deepest and not really whether those assignments are “right”. Of the different methods the Wang classifier regularly comes out on top, regardless of the database.

Taxonomic names are a lot like points on Who’s Line Is It Anyway. They’re a historical artifact of fights between taxonomic lumpers and splitters (e.g. Bacillus subtilis, thuringiensis, and cereus or Bacteroides). You might try using classify.otu using the classifications from the three methods. That way you could then at least know the differences you see.

Hope this was strongly opinionated and brief enough!

Flip a coin?

gotta say I was hoping for stronger opinion than that :wink:

I’m with you on the pointlessness of obsessing over taxomony, but I’m now in a position of running other peoples data and they all want to know “what species [singular] causes” whatever they are interested in. I’m fighting to get them to accept mothur rather than qiime (since I have to run it, I feel that I should get to choose the program) but maybe I’ll switch to gg taxonomy as the default so at least that part looks familiar to clients. I’d much rather fight over the righteousness of alignment based clustering and repeated subsampling for alpha/beta diversity measures (not to mention a nicely written program rather than a random collection of scripts), than which taxonomy is marginally less flawed.

Hello black hole of taxonomy. I classified the same sequences against RDP, greengenes, silva nr_119. Greengenes is wildly different

taxon Greengenes Silva nr_119 RDP
Acidobacteria 102 190 197
Actinobacteria 538183 113799 113784
Bacteroidetes 437580 444947 444984
Cyanobacteria 464320 410
Firmicutes 1092447 2205982 2227724
Proteobacteria 263012 7133 7147
Tenericutes 2702 5019 77
unclassified 19406 4302
Verrucomicrobia 470139 470003 470187

Basically greengenes classifies 5X as many actino, 4X as many proteo, 3 orders of magnitude more Cyano and half as many Firmicutes as Silva or RDP. My thoughts on the Cyanobacteria is that maybe greengenes doesn’t have mitochondria/chloroplasts flagged? The Cyanobacteria are nearly all classified as genus Leptolyngbya. I’m probably going to stick with the RDP taxonomy for the moment, at least it and silva are fairly similar. (I like precision when I can’t evaluate accuracy)