We have been using the Green Genes database to analyze our 16SrRNA targets. We are getting a lot of classes, orders, families etc. that are identified as c_, o_, f_…
I would like to know what is the difference with unclassified sequences and the sequences that are labeled c_, o_ f_?
In Greengenes the 1 letter prefix denotes the level in the taxonomy (p = phyla, c = class etc). Bayesian classification is based on the differences in the frequency of kmers between taxonomic clusters in the reference database, and this means that the more fine-detail your taxonomy is the smaller the differences between organisms become. For example, there’s quite a large difference in the kmer profiles of Firmicutes vs Bacteroidetes, but a much smaller difference between Eschericha and Shigella.
This means that classification becomes ‘harder’ the further down the taxonomic hierarchy you move and depending on your query sequence you may not be able to classify down to genus/species level, although you can probably get an accurate classification part of the down the hierarchy (how you interpret that is up to you). A c_unclassified would mean you were able to accurately classify the sequence to a given phylum, but not to the class. The o_unclassified would mean you got an accurate assignment to class, but not order, etc. etc. You’ll be able to find the exact taxonomy in the *.taxonomy file.
Recently I saw that there was a request for mothur to stitch the unclassified tag to the previous level in the taxonomy so that you can more easily interpret these results, but as far as I know there’s currently no automatic way to do this.
Thank you for your response. We would like to clarify something.
Here is an example of what we have in our output:
If we understand properly, the unclassified sequences in this case would belong to the Holophagaceae family. They therefore could have been identified as g_?
Hm, that might actually be something else. This only just occurred to me, but if I remember correctly there are some entries in the Greengenes taxonomy that don’t have taxonomic information down to species level, they end at order/family/genera and then just have empty o_, f_, g_ and s_ values for the remainder. That might be what you have, I would expect an unclassified Holophagaceae to look something like (and Pat will correct me if I’m wrong here :mrgreen: )
X.X.X.1 f_Holophagaceae 1
X.X.X.1.1 Unclassified 1
in your *.tax.summary file. If you have some thing like
X.X.X.1 Holophagaceae 3
X.X.X.1.1 g_ 2
X.X.X.1.2 Unclassified 1
Then you have 1 query sequence that couldn’t be accurately classified to genus level, and 2 that classified to a cluster of Holophagaceae with a blank (g_) genus entry.
In the gg_13_5_99.gg.tax greengenes taxonomy file the lines with g__Geothrix look like k__Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__Geothrix;s__;
So you shouldn’t be getting k__Bacteria;p__Acidobacteria;c__;o__;f__Holophagaceae;g__Geothrix;
What is the exact syntax of the command you are running and can you post the sequence that is generating this type of taxonomy?
Thank you for your response!
Our syntax is:
trim.seqs(fasta=my_fastq_file.fasta, oligos=my_file_that_has_MIDs.txt, qfile=my_fastq_file.qual, qwindowaverage=20, qwindowsize=50, minlength=number, keepforward=T, processors=x)
classify.seqs(fasta=my_fastq_file.trim.fasta, method=wang, group=my_fastq_file.groups, template=greengenes_version.fasta, taxonomy=greengenes_version.gg.tax, cutoff=50, processors=x)
In our results, we would obtain three separate sequences like:
We are not sure we understand correctly the difference between the unclassified and empty taxa. Here is our interpretation:
Eg. Known phylum followed by Unclassified
The threshold for accurate Mothur Bayesian classification was met to match a phylum but a class within this phylum cannot be found in GreenGene .
The sequence met the accuracy threshold of Mothurâ€™s Bayesian classification to be assigned to a phylum but not to a class.
Is our interpretation correct?
qwindowaverage=20 won’t do anything. The reads come off the sequencer with an average quality score of 20. You really need to use qwindowaverage=35 if you are working with 454 data
You aren’t working with our greengenes files. greengenes inserts the c_; because it doesn’t fit their taxonomy. If mothur includes “unclassified” that means the consensus score was less than your threshold (e.g. 50 in your case).
50 is a very low threshold. 80 is recommended.
I would strongly encourage you to use the mothur-formatted files available here:
Thank you for your response. We are currently working with PGM Ion Torrent data.
We have been using your GreenGene database for this analysis. We used the May release gg_13_5_99. We will try running the analysis on the August version to see if there is a difference.
I will keep you posted!