I was poking around the world of taxonomy (eek, I’m happy thinking otu1, otu2 but some people insist on attaching names), the Tenericutes in particular. Someone had analyzed their pilot project in qiime (greengenes taxonomy) and were concerned when my analysis of their complete dataset didn’t have a predominance of Tenericutes like the qiime analysis did. I have a strong feeling that their qiime Tenericutes are mostly Erysipelotrichia which are common in my analysis using the RDP trainset for classification-RDP just puts them in the Firmicutes. Which lead me to looking through you taxonomy options and I’m a bit confused as to what you recommend. The Miseq SOP says RDP trainset for classification but in the Silva 119 readme, you say that you suggest using the silva_nr for classification.
Can you give a brief (yet strongly opinionated) run down of which taxonomy to use when?
Can you give a brief (yet strongly opinionated) run down of which taxonomy to use when?
I think we can be friends :). Flip a coin?
As far as which is “best”, that really depends on the environment and specific bugs in your dataset. Some will point to greengenes because it goes to the species level, but that is only for about 10% of the sequences in the database. I would perhaps pick “best” by classifying to the three databases and see which does the best job of classifying to the deepest level and then go with that. But that really only addresses which classifies deepest and not really whether those assignments are “right”. Of the different methods the Wang classifier regularly comes out on top, regardless of the database.
Taxonomic names are a lot like points on Who’s Line Is It Anyway. They’re a historical artifact of fights between taxonomic lumpers and splitters (e.g. Bacillus subtilis, thuringiensis, and cereus or Bacteroides). You might try using classify.otu using the classifications from the three methods. That way you could then at least know the differences you see.
Hope this was strongly opinionated and brief enough!
Pat
gotta say I was hoping for stronger opinion than that
I’m with you on the pointlessness of obsessing over taxomony, but I’m now in a position of running other peoples data and they all want to know “what species [singular] causes” whatever they are interested in. I’m fighting to get them to accept mothur rather than qiime (since I have to run it, I feel that I should get to choose the program) but maybe I’ll switch to gg taxonomy as the default so at least that part looks familiar to clients. I’d much rather fight over the righteousness of alignment based clustering and repeated subsampling for alpha/beta diversity measures (not to mention a nicely written program rather than a random collection of scripts), than which taxonomy is marginally less flawed.
Basically greengenes classifies 5X as many actino, 4X as many proteo, 3 orders of magnitude more Cyano and half as many Firmicutes as Silva or RDP. My thoughts on the Cyanobacteria is that maybe greengenes doesn’t have mitochondria/chloroplasts flagged? The Cyanobacteria are nearly all classified as genus Leptolyngbya. I’m probably going to stick with the RDP taxonomy for the moment, at least it and silva are fairly similar. (I like precision when I can’t evaluate accuracy)