classify.seqs output: inconsistent taxon name usage

Dear Mothur users and developers,
Using the classify.seqs command I noticed that the taxonomic summary file seems to be inconsistent in taxon name usage.
In the file “mysamples.final.rdp6.tax.summary” one will find several taxa within a given taxonomic level that are called “unclassified”.
E.g. at phylum level (level 2) rank ID’s 0.1.34.1 & 0.1.35.1 both have “unclassified” as taxon name. I would expect them to be named “unclassified……”.
I think that if this is a real bug it will seriously effect donwstream taxonomy based diversity analysis.
Best wishes,
Guus

taxlevel rankID taxon daughterlevels total Sample1
0 0 Root 1 122284 20708
1 0.1 Bacteria 10 122284 20708
2 0.1.2 Actinobacteria 1 4409 1620
2 0.1.5 Bacteroidetes 4 3666 9
2 0.1.11 Cyanobacteria 1 805 788
2 0.1.13 Deinococcus-Thermus 1 1 0
2 0.1.15 Firmicutes 5 111887 17525
2 0.1.24 Proteobacteria 5 111 18
2 0.1.29 Tenericutes 1 18 0
2 0.1.32 Verrucomicrobia 1 56 0
2 0.1.34 unclassified 1 1329 748
2 0.1.35 unclassified_Bacteria 1 2 0
3 0.1.2.1 Actinobacteria 5 4409 1620
3 0.1.5.2 Bacteroidia 1 3621 6
3 0.1.5.3 Flavobacteria 1 2 1
3 0.1.5.4 Sphingobacteria 1 3 2
3 0.1.5.5 unclassified 1 40 0
3 0.1.11.1 Cyanobacteria 1 805 788
3 0.1.13.1 Deinococci 1 1 0
3 0.1.15.1 Bacilli 3 102789 17062
3 0.1.15.2 Clostridia 2 7995 398
3 0.1.15.3 Erysipelotrichi 1 306 13
3 0.1.15.5 unclassified 1 726 52
3 0.1.15.6 unclassified_Firmicutes 1 71 0
3 0.1.24.1 Alphaproteobacteria 4 10 4
3 0.1.24.2 Betaproteobacteria 2 20 2
3 0.1.24.3 Deltaproteobacteria 1 4 0
3 0.1.24.5 Gammaproteobacteria 2 64 8
3 0.1.24.6 unclassified 1 13 4
3 0.1.29.1 Mollicutes 1 18 0
3 0.1.32.5 Verrucomicrobiae 1 56 0
3 0.1.34.1 unclassified 1 1329 748
3 0.1.35.1 unclassified 1 2

It’s not a bug, it’s a feature :slight_smile: Don’t worry this doesn’t affect downstream analyses. When you run the phylotype command on the companion taxonomy file, mothur keeps everything straight.

Hi Patrick,
Thanks for your reply.
Yes, now I can see why this doesn’t affect alpha and beta-diversity analyses.
However, I still don’t understand the advantage of this “feature”. It makes it hard to use the rdp6.tax.summary file to generate e.g. a pie chart with the taxonomic breakdown at a given tax level. You would end up with several sections called “unclassified”. It will be a pain to classify these manually when dealing with many very species rich samples.
Why not call them “unclassified……” like the RDP Multiclassifier does?
Best,
Guus

So the problem is what to do with things that are assigned to TM7 and have an unclassified class, order, family, and genus. Our goal was to have the same number of sequences at each taxonomic level in the table. Also, it becomes difficult to put TM7;class_incertae_sedis;order_incertae_sedis;family_incertae_sedis;genus_incertae_sedis; because the rdp6 taxonomy is the only outline that actually corresponds to the Linnean taxonomy and we don’t want to make the exception the rule. Also, the series of numbers in the second column that are separated by periods actually does this already.

Re: pie charts
http://www.juiceanalytics.com/writing/the-problem-with-pie-charts/
:smiley:

If you have ideas that we could apply regardless of the taxonomy outline we’d love to hear how to make the output more useful.
Thanks for the feedback…
Pat

Hi Pat,
Ok, forget that I ever mentioned a piechart :oops:
“[Piecharts] have no place in the world of grownups, and occupy the same semiotic space as short pants, a runny nose, and chocolate smeared on one’s face…”

I see the problem with TM7… the RDP Multiclassifier handles this pragmatically (non-Linnean) by putting the genus “TM7_incertae_sedis” as a member of the “generic taxonomic group” unclassified_TM7 within the phylum TM7.
Still, using the rank ID to trace back the “lower” taxonomic levels for each group of unclassifieds in a bargraph :wink: can be a lot of work and I am tempted to make a script to automate this.
Best wishes,
Guus

All kidding aside, you do make a good point and we’ll think about a way to deal with this…