I have a very high abundance of ‘Bacteria_unclassified’ in my taxonomy file: Bacteria(100);Bacteria_unclassified(100);Bacteria_unclassified(100);Bacteria_unclassified(100);Bacteria_unclassified(100);Bacteria_unclassified(100);Bacteria_unclassified_unclassified(100);
I classified with the silva.nr_123 database.
I was wondering if this was normal? or if someone has come over this before?..or if it is my data that is crap…?
I guess we can remove them with the remove.lineage( fasta=X, count=X, taxonomy=X, taxon=Bacteria_unclassified) ?
What percentage? How long are your sequences? What environment?
I noticed this too. I didn’t get alot per se, but when I made a heatmap with my Otu data and limited the output to the 75 most abundant, Bacteria_unclassified was listed several times, which got my attention. In my stabilityps1.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy file, there are 103864 total entries. There are 553 entries counting just ones that have Bacteria(100);Bacteria_unclassified(100);Bacteria_unclassified(100);Bacteria_unclassified(100);Bacteria_unclassified(100); associated. I found which Otus had the most sequences in them, grabbed the representative seq that represented the most and blasted it. The two top hits were uncultured, unidentified environmental samples in both cases. The results made sense based on my samples.
My sequences are of the V3 and V4 region, unaligned they are 464nt, aligned to the Silva_v128 database (I used this for classification as well), and subsequently filtered, they are 846nt. They were run on a MiSeq using the 16S amplicon protocol.
I see this quite a bit in my samples which are mostly marine sediments from locations that have only been recently sampled for the fist time.
As the classifications are only as good as the classification db, it’s entirely possible that these bacteria_unclassified are just that, bacteria for which there is no closely related classified hit.
The important question is where are your samples from, and is the presence of these unclassified Bacteria reasonable? If it were from the human gut I would think not, but if it’s from a high diversity environmental sample, then perhaps.
Hi Thanks !
Yes they represent about 2 to 30% in my samples which are Marine biofilms form different environment (16s V3-V4 ~MiDSeq 2x250)
I guess it could be then that they are not classified yet!..
One more thing is that classifying the OTUs in mothur is not the end of the process. I would BLAST the representative sequences from these Bacteria_unclassified OTUs. You may well find you get good hits to uncultured clones from environments similar to the one you sampled, which would support the idea that they are valid community members, just not well classified.
I don’t know if this is still a question of interest, but while working with my environmental samples using MiSeq Illumina seq (single-end mode) and the Qiime2 and Mothur pipelines I was running into many issues and questions. One was the fraction of unclassified bacteria which was first <30% per sample. I tried different alignment approaches (for instance direct Sina alignment with silva as reference database) and could reduce this value to 15-23% per sample. However, after doing some research and personal communications, the issue lies in the bootstrap values of the classifier, here I was working with a cut-off of 80%. In the Silva manual somewhere, it says a bootstrap of 40% should also be acceptable. With this value, I reduced the unclassified bacteria to <5% per sample. This could be helpful for the statistical analysis with RPCA or so because you have better taxonomic assignments, but be careful, playing with lower bootstrap values also increases the probability of wrong assignments (‘likelihood of the tree’)!
For the classifisation, I recommend using a positive control of some sort to see how the different databases react to your sequences. For example, based on Zymo DNA, (I do classic gut bacteria analysis), when I use PDS I must set the cut-off at 70% to get a correct classification of sequences because if I run at 75% I am loosing, for example, my Salmonella taxonomic assignment in my positive control. But for Silva, a 75% cut-off is ok and I am running a classification as we speak at 80% to see if my positive control still make sense. I do 1000 iters (I am running on Compute Canada servers).
Controls are the key, both negative and positive.
Best of successes
Hello! I just finished my classification with PDS18 and Silva138. With a cutoff of 70, PDS18 is still not able to correctly classify my positive control (I swear it did with PDS16) while Silva138 is fine using a cutoff of 80.
Just to confirm that there is a problem with PDS18 for classification. Even with a cutoff of 65%, it still do not assign correctly my positive control. If you want to use PDS, use trainset16.
I’d strongly discourage using a threshold under 80%. By dipping down to lower levels you are admitting that you have less confidence in the data, which never seemed like a good idea to me. As Alexandre mentioned, you can try other datbases and see if the classification improves.