What is the difference between unclassified and c_, o_, f_ e

fortinn · July 18, 2014, 7:32pm

Hello,

We have been using the Green Genes database to analyze our 16SrRNA targets. We are getting a lot of classes, orders, families etc. that are identified as c_, o_, f_…
I would like to know what is the difference with unclassified sequences and the sequences that are labeled c_, o_ f_?

Thank you!

Nathalie

dwaite · July 20, 2014, 7:20am

In Greengenes the 1 letter prefix denotes the level in the taxonomy (p = phyla, c = class etc). Bayesian classification is based on the differences in the frequency of kmers between taxonomic clusters in the reference database, and this means that the more fine-detail your taxonomy is the smaller the differences between organisms become. For example, there’s quite a large difference in the kmer profiles of Firmicutes vs Bacteroidetes, but a much smaller difference between Eschericha and Shigella.

This means that classification becomes ‘harder’ the further down the taxonomic hierarchy you move and depending on your query sequence you may not be able to classify down to genus/species level, although you can probably get an accurate classification part of the down the hierarchy (how you interpret that is up to you). A c_unclassified would mean you were able to accurately classify the sequence to a given phylum, but not to the class. The o_unclassified would mean you got an accurate assignment to class, but not order, etc. etc. You’ll be able to find the exact taxonomy in the *.taxonomy file.

Recently I saw that there was a request for mothur to stitch the unclassified tag to the previous level in the taxonomy so that you can more easily interpret these results, but as far as I know there’s currently no automatic way to do this.

fortinn · July 21, 2014, 1:42pm

Hello,

Thank you for your response. We would like to clarify something.

Here is an example of what we have in our output:
p_Acidobacteria
c_Holophagae
o_Holophagales
f_Holophagaceae
g_
s_
g Geothrix
s_
unclassified
unclassified

If we understand properly, the unclassified sequences in this case would belong to the Holophagaceae family. They therefore could have been identified as g_?

Nathalie

dwaite · July 21, 2014, 9:50pm

Hm, that might actually be something else. This only just occurred to me, but if I remember correctly there are some entries in the Greengenes taxonomy that don’t have taxonomic information down to species level, they end at order/family/genera and then just have empty o_, f_, g_ and s_ values for the remainder. That might be what you have, I would expect an unclassified Holophagaceae to look something like (and Pat will correct me if I’m wrong here :mrgreen: )

X.X.X.1 f_Holophagaceae 1
X.X.X.1.1 Unclassified 1

in your *.tax.summary file. If you have some thing like

X.X.X.1 Holophagaceae 3
X.X.X.1.1 g_ 2
X.X.X.1.2 Unclassified 1

Then you have 1 query sequence that couldn’t be accurately classified to genus level, and 2 that classified to a cluster of Holophagaceae with a blank (g_) genus entry.

pschloss · July 24, 2014, 3:50pm

In the gg_13_5_99.gg.tax greengenes taxonomy file the lines with g__Geothrix look like k__Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__Geothrix;s__;

So you shouldn’t be getting k__Bacteria;p__Acidobacteria;c__;o__;f__Holophagaceae;g__Geothrix;

What is the exact syntax of the command you are running and can you post the sequence that is generating this type of taxonomy?

fortinn · July 30, 2014, 7:56pm

Hello,

Thank you for your response!

Our syntax is:

trim.seqs(fasta=my_fastq_file.fasta, oligos=my_file_that_has_MIDs.txt, qfile=my_fastq_file.qual, qwindowaverage=20, qwindowsize=50, minlength=number, keepforward=T, processors=x)

classify.seqs(fasta=my_fastq_file.trim.fasta, method=wang, group=my_fastq_file.groups, template=greengenes_version.fasta, taxonomy=greengenes_version.gg.tax, cutoff=50, processors=x)

In our results, we would obtain three separate sequences like:

k__Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g_;s__
k__Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__Geothrix;s__
k__Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;unclassified;unclassified

We are not sure we understand correctly the difference between the unclassified and empty taxa. Here is our interpretation: Eg. Known phylum followed by Unclassified Unclassified means: The threshold for accurate Mothur Bayesian classification was met to match a phylum but a class within this phylum cannot be found in GreenGene .

c_ means:
The sequence met the accuracy threshold of Mothurâ€™s Bayesian classification to be assigned to a phylum but not to a class.

Is our interpretation correct?

Nathalie

pschloss · July 31, 2014, 2:27pm

Three things…

qwindowaverage=20 won’t do anything. The reads come off the sequencer with an average quality score of 20. You really need to use qwindowaverage=35 if you are working with 454 data
You aren’t working with our greengenes files. greengenes inserts the c_; because it doesn’t fit their taxonomy. If mothur includes “unclassified” that means the consensus score was less than your threshold (e.g. 50 in your case).
50 is a very low threshold. 80 is recommended.

I would strongly encourage you to use the mothur-formatted files available here:

http://www.mothur.org/wiki/Greengenes-formatted_databases

Pat

fortinn · August 6, 2014, 5:25pm

Hello,

Thank you for your response. We are currently working with PGM Ion Torrent data.

We have been using your GreenGene database for this analysis. We used the May release gg_13_5_99. We will try running the analysis on the August version to see if there is a difference.

I will keep you posted!

Best regards,

Nathalie

Topic		Replies	Views
Classification - given names Theory behind mothur	8	960	April 16, 2020
Database Curation Commands in mothur	4	2162	January 30, 2015
Taxonomy in classify seqs Theory behind mothur	3	3135	February 27, 2015
New Format for Classify.seqs '.taxonomy' file? Commands in mothur	1	747	April 18, 2017
classify.seqs command out put is not correct Commands in mothur	1	1061	June 7, 2016

What is the difference between unclassified and c_, o_, f_ e

Related topics