unclassified sequences?

Hi,

I used silva database for the V4 region. After running the batch command, I get the following taxonomy file (top few lines):

OTU Size Taxonomy
Otu00001 441613 Bacteria(100);Actinobacteria(100);Actinobacteria(100);Bifidobacteriales(100);Bifidobacteriaceae(100);Bifidobacterium(100);
Otu00002 237387 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Blautia(100);
Otu00003 161365 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Faecalibacterium(100);
Otu00004 159256 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Roseburia(100);
Otu00005 122772 Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Streptococcaceae(100);Streptococcus(100);
Otu00006 118688 Bacteria(100);Firmicutes(100);Erysipelotrichia(100);Erysipelotrichales(100);Erysipelotrichaceae(100);Clostridium_XVIII(99);
Otu00007 114743 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);
Otu00008 100156 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Anaerostipes(100);
Otu00009 97400 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcus(100);
Otu00010 96347 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Blautia(98);
Otu00011 95365 Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Streptococcaceae(100);Streptococcus(100);
Otu00012 83085 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Ruminococcus2(62);
Otu00013 73078 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);
.
.

There are a total of 3533298 reads for my samples. My questions:

  1. What is an acceptable % of ‘unclassified’ sequences?
  2. How do I deal with them? Do I need to try greengenes too?
  3. Do I just delete the OTUs (does mothur have a script for this)?

thanks!

Hello,

So the unclassified sequences aren’t really a bad thing, and you expect them more or less often depending on the environment you were sampling so there is no “acceptable %” as it were. If it’s an environment that lacks well characterized organisms (and therefore is not well represented in the database) like sediments and soils often are, then you would expect a higher amount of unclassified’s than if you were sequencing a gut microbiome for example, whose members are comparatively well characterized and therefore well represented in the reference databases.

You can try using greengenes or another database to classify your sequneces (you really should be using the one that best suits your environment anyways, as they do differ). But the classification step in mothur is not the end of the road in terms of classification; really it is only a first step. For example, you could take the representative fasta sequences for these OTUs and BLAST them and see what you get. Are the BLAST hits for the unclassified’s what you would expect to see in your samples?

TLDR; unclassified aren’t bad, but think about what you expect to see in your sampled environment.

Cheers
Richard

Hi Richard,

Thanks for the reply. The samples are gut/stool, so I shouldn’t expect too many unclassifieds, right? How do I extract the representative fasta sequences? And I BLAST them at “Microbial Nucleotide BLAST” page at NCBI?

thanks!

Given you’re working with gut microbiome you probably would not expect a huge number of unclassified’s but I would still expect some. Again there’s no golden rule about this stuff but to put it in perspective in my marine sediment samples I have 11,322 OTUs that are unclassified at some level out of a total of 14,500 OTUs (most are classified to at least family or genus level). When I BLAST my unclassified’s I frequently get good hits from uncultured clones from very similar environments.

To get the representative sequence for your OTUs you use the get.oturep command in mothur. You can also combine these representative sequences with your shared file and taxonomy data into a single database file using create.database that I find very useful for downstream analysis.

Yes to BLAST the sequences you can use the ncbi BLAST website.

Cheers
Richard