The EMP, QIIME and Mothur

So, I’m dealing with data from “The Earth Microbiome Project”… Illumina HiSeq @ 100bp… I decided to run the data through Mothur rather than used the QIIME analysed data they provide. They say;

_97% OTUs are picked in the initial EMP analyses using a closed-reference OTU picking protocol against the Greengenes database pre-clustered at 97% identify
This is done using

This process works as follows. Reads are pre-sorted by abundance in QIIME so the most frequently occurring sequences will be chosen as OTU centroid sequences. Each read is then searched against the Greengenes reference sequences using reference-based uclust version 1.2.22. The call to uclust issued by QIIME looks like:

uclust --id 0.97 --w 12 --stepwords 20 --usersort --maxaccepts 20 --libonly --stable_sort --maxrejects 500

Reads which hit a sequence in the reference collection at greater than or equal to 97% identity are assigned to an OTU defined by the reference sequence they match. Reads which fail to hit a reference sequence at at least 97% identity are discarded_

with that I end up with about 200 000 sequences per sample, with presumably good taxonomy (unclassifieds have intrinsically been removed)

when I run the EMP quality filtered sequences through Mothur, i end up with 55000 sequences (there is a greater quality control with the mothur run, the initial quality filtered sequences had base calls as low as Q18) and hopeless taxonomy - there is a tonne of unclassified sequences.

so how can I end up poorer taxonomy with greater quality sequences? Any ideas? How do they algorithms in QIIME work? Their website is confusing…

So I know virtually nothing about QIIME, I’d suggest asking them QIIME-related questions…

I’m not 100% on how you’re using mothur to process the sequences, but if you are trimming them to something even shorter than they were initially, that would affect classification. Short sequences classify poorer than longer sequences. Also, if you’re using a different confidence score threshold that could affect things as well.

Hope this helps…

hey Pat,
they used 75-100bp, while I’ve ended up with 96bp after MSA so I’m pretty sure that sequence length shouldn’t be an issue. It is such a load of crap! I think they used uclust to cluster sequences to a reference taxonomy or something - it is so vague. From my understanding, uclust performs badly anyway (papers comparing clustering). There is no mention of their confidence scores either. When I use a cutoff of 60, I get unknowns and unclassified at phylum level. I think I’ll shoot them an email.

Thanks for the help and keep up the good work - I still have my confidence in mothur!


After a bit of searching;

_In a closed-reference OTU picking process, reads are clustered against a reference sequence collection and any reads which do not hit a sequence in the reference sequence collection are excluded from downstream analyses. is the primary interface for closed-reference OTU picking in QIIME. If the user provides taxonomic assignments for sequences in the reference database, those are assigned to OTUs.

You must use closed-reference OTU picking if:

You are comparing non-overlapping amplicons, such as the V2 and the V4 regions of the 16S rRNA. Your reference sequences must span both of the regions being sequenced.

You cannot use closed-reference OTU picking if:

You do not have a reference sequence collection to cluster against, for example because you’re working with an infrequently used marker gene.


Speed. Closed-reference OTU picking is fully parallelizable, so is useful for extremely large data sets.
Better trees and taxonomy. Because all OTUs are already defined in your reference sequence collection, you may already have a tree and a taxonomy that you trust for those OTUs. You have the option of using those, or building a tree and taxonomy from your sequence data.


Inability to detect novel diversity with respect to your reference sequence collection. Because reads that don’t hit the reference sequence collection are discarded, your analyses only focus on the diversity that you “already know about”. Also, depending on how well-characterized the environment that you’re working in is, you may end up throwing away a small fraction of your reads (e.g., discarding 1-10% of the reads is common for 16S-based human microbiome studies, where databases like Greengenes cover most of the organisms that are typically present) or a large fraction of your reads (e.g, discarding 50-80% of the reads has been observed for “unusual” environments like the Guerrero Negro microbial mats)._

Again, there is no clear definition of how they cluster the raw sequences to the reference taxonomy, but i’m assuming they use uclust with the reference sequences as centroids…

Pat, what are your thoughts on this approach to classifying sequences? I guess they use pairwise alignments with the raw sequences to the reference, and if it is =>97%, it gets given that taxonomy (thats my understanding?). This is different to kmer searching such as the RDP/Wang classifer…


Does the search=distance in the classify.seqs give you an analogous approach here?

My opinion? Well, I think I’ve been pretty public in our papers that database-independent approaches are the way to go for microbes. We then go back and classify OTUs using something like classify.otu. I think the requirements for their closed database should be even more stringent than they allow. For instance, they have a database of sequences that are 3% different. But for what region? We’ve shown in a PLoS Comp Biology paper how there are very different rates of evolution across the gene. So really, you’d need a region-specific database. Furthermore, in their “You must use…” comment, in light of this variation across the gene, comparing V2 to V4 is a fool’s errand. You just can’t do it at a meaningful level (e.g. sub-order?).

Our observation has been that if we get good data, these questionable heuristics are not needed. If you look at their papers, they’re trying to excuse away bad data or look at such a broad level that it isn’t really informative. As we showed in our Gut Microbes paper from last year, we do see that different OTUs within a genus behave differently. These are some of the reasons why we stick with OTU-based approaches and try to avoid the biases inherent in the database-centered approaches.


Pat, you’re the best. Thanks for the info.

Dear Shaun & Patrick,

sorry for bringing this topic up again after nine months. I came across it while googling, and I think that there’s one important point to add.

I fully agree with both your explanations, although I think that with a very good (!) reference OTU set, open-reference or closed-reference picking can be quite useful. However, I noted that Greengenes currently provides 97% uclust pre-clustering. This means: you get a set of heuristic reference OTUs, and a heuristic is used to map your reads against it.
Although I am sure that you (Shaun) have moved on in the meantime, I have a question out of pure curiosity: did you compare results (i.e., your biological findings) between the different pipelines? I mean, did you observe the same things for the reference-based pipeline and the mothur-de novo clustering pipeline?



Thanks Sebastian,

The other thing to add is that 97% similarity over what window? I would presume the full length gene. We’ve shown elsewhere (PLoS Comb Biol) that none of the subregions evolve at the same rate. So some species/genera may have identical 16S within your region even though they are 97% at the full length. The converse would also be true.

Ultimately, I think this is a trick for dealing with crappy 100PE sequence data and it’s a shame that EMP sacrificed data quality for the amount of data.


Hey guys!

I’m just looking at this again! We got some new data back from the EMP, but are dealing with raw sequences from a HiSeq this time, hence me checking back on the forums.

Sab, I never did compare results, but now that you have asked, I will have a look. This will have to be done anyway, because another group with us have already written a manuscript with the EMP analysis of the first sequences, while this time I am analysing the sequences (a second round) with mothur - obviously they want some consistencies in their experiments.