UPARSE pipeline

In case anyone missed it, Robert Edgar has pooled his various usearch programs together into one solid pipeline. I’m in the middle of running some test data through it to see how it compares - the mothur 454 SOP vs. the mothur 454 SOP with singletons removed vs. a UParse/mothur hybrid (clustering in UParse, then alignment/classification/distance metrics in mothur).

It’s quite a nice system, you certainly end up with fewer OTUs for the initial steps, although the quality of some of these is dubious (a few of mine couldn’t be aligned, and then a few more ended up unclassified). It also relies of removal of singletons, so it may be cheating to say it has fewer OTUs :mrgreen:

In general, what are peoples thoughts on this sort of approach - to cluster you reads based prior to alignment, rather than aligning to a database and filtering from there?

Have not tried the UPARSE pipeline yet, but the paper suggests that the analyses can be done without removing singletons as well. Is this not the case in the pipeline?

Good point, there’s simply a step where the default settings remove singleton OTUs but you can easily overwrite it. In general, it doesn’t seem to matter though. Compared to mothur with singletons removed, uparse has slightly stronger results (less stress, higher R-squared, more variation explained in PCoA axes) but it’s minimal. It almost never affects the statistical significance of a finding.

Mostly it’s just a bit faster.

I wasn’t a reviewer, but based on my initial read of the abstract and 5 sec skimming of the paper, you’ve hit on a number of things I would wonder about…

It also relies of removal of singletons, so it may be cheating to say it has fewer OTUs

. Exactly.

  1. I also didn’t observe any objective metric for assessing OTU quality. Sorry, but I will hold any new approach to the standards we laid out in http://aem.asm.org/content/77/10/3219.abstract.

  2. My biggest problem with the alignment-independent approaches used by USEARCH etc, is that there are many advantages to aligning and proceeding from there as outlined at http://www.nature.com/ismej/journal/vaop/ncurrent/full/ismej2012102a.html

Again, I need to take a closer look at this and I'll get back to y'all.

Hi, apologies as this isn’t entirely a mothur-orientated post. I am totally new to this area and have just started using UPARSE, Google pointed me to this discussion. I just wondered, what do you do after you’ve got your OTU table from UPARSE?, feed it back into something like QIIME for taxonomy annotation?, convert it using something like http://biom-format.org/documentation/biom_conversion.html? Many thanks for any tips for a newbie.

Yea, pretty much. When I used it I made a quick-and-dirty python script to take the readmap file from usearch and use it to merge entries from my existing count table according to how usearch grouped the individual reads. So something like:

mothur > extract flows, denoise, trim, build a count table.

usearch > pass in denoised fasta file and process that into OTUs

python > process the old count table to match the new fasta file

mothur > alignment/classification. You have to be careful from here on though, because your OTU labels have different meanings now (eg, a 'unique' label doesn't represent the unique reads in your sample, but what usearch thinks the OTUs are).

To be honest, I haven’t really gone back to uparse, I found for the effort of hopping between pipelines the results weren’t worth it. You pretty much get the same results just using the mothur SOP with the remove.rare() command thrown in and then like Pat said above, you have the confidence of the experimental validation of mothurs approach.

Hello all,

this uparse question isn’t directly related to the original post, but I got directed to this thread after a google search. i have been using mothur for my analysis, but I am interested in comparing other pipelines. I am starting uparse, however I am unsure how uparse resolves mismatches in the overlapped region. I am assuming the higher base call gets chosen, however does anyone know if the new quality score is a an average, or just the higher base call? I have not been able to find this information in the usearch manual or in his uparse nature paper.

Can’t speak on behalf of the real mothur people here, but I guess that it generally makes more sense to contact Robert Edgar directly with UCLUST-, UPARSE-, etc related questions. From my personal experience, he replies quite readily (and generally helpfully) to e-mails. :wink:

In our group, we’ve recently looked into comparing UPARSE to other pipelines, mainly focusing on chimera removal and clustering. The manuscript is still under revision, but the bottomline was that UPARSE generally behaves QUITE differently from all other methods we tested (UCHIME-AL, -CL, -SL, -UCLUST, -CD-HIT). I don’t know about your specific question, really (again: ask R. Edgar directly…), but my impression is that UPARSE does stuff differently than existing approaches at many steps during the pipeline. And “differently” here does not imply any judgement as to whether it’s “better” or “worse” :wink:

My biggest problems with UPARSE and UCLUST are that there is virtually no benchmarking (fewer OTUs does not equal good), the method is poorly described in the paper, and it is closed source.

Hi, I’ve used the UPARSE pipeline with six different amplicons so far and am pleased with the results (none of these are barcoding genes), and I’d like to play around with some of the tests available in mothur. If anyone has a code they use to convert files from the UPARSE formats to a mothur shared file format and are willing to share it, I would greatly appreciate that. I can’t code worth beans and I have several deadlines just dumped in my lap which are “ASAP”, so I don’t really have time to fool around with making alignments of GenBank data for my nonstandard amplicons so I can use the mothur pipeline.

Thanks for any help you can provide,



If you drop me an email (david.waite at auckland.ac.nz) I can take a look for you. I have some scripts for doing this automatically, but they’re quite finicky about the input files so I’ll probably need to adapt them for your input data.

Hi- in trying to find a way to manually make a mothur shared file, I find that my data isn’t readable by mothur. Are these files text files, and what delimiters are allowed? (tabs, spaces, commas?)

They’re just text files, tab separated. The format is described at http://www.mothur.org/wiki/Shared_file.

In response to Pat’s comments “there is virtually no benchmarking” and “fewer OTUs does not equal good”. I think “virtually no” benchmarking is a bit unfair. Extensive benchmark results are given in the Nature Methods paper (many of them given in the supp notes only due to space limitations). The main results are summarized here: http://www.drive5.com/usearch/manual/otu_clustering.html. I totally agree that the number of OTUs by itself is not a useful measure of OTU quality on a mock community, and I designed the validation with this in mind. Instead, I measured accuracy of the representative sequences, and found that a large majority of the OTU sequences generated by UPARSE are exactly correct biological sequences. By that measure, UPARSE was better than other methods which produced 50% or more spurious OTUs (chimeric or >3% diverged from a true biological sequence). The number of OTUs is less important, and there is no “correct” number even on a mock community. However, the number of OTUs is often used to estimate diversity, and here UPARSE was shown to have an important advantage because the number of OTUs on the mock communities was very stable – the number was very close to the number of species and was almost unchanged on sets of reads ranging from a few thousand 454 reads to a few million MiSeq reads. This was achieved without any parameter tuning for different datasets or any technology-specific steps in the pipeline (e.g., no flowgram denoising). While it’s dubious to extrapolate from mock to real communities, this gives some confidence that a diversity estimate could be informative and could be comparable between different datasets. By contrast, other methods produced much more variable numbers of OTUs, usually many more than the number of species. Another criticism of UPARSE in this forum is that it is “cheating” by discarding singleton reads. The UPARSE paper shows that if you don’t discard singletons, the number of spurious OTUs blows up. This particular result surely does extrapolate from mock to real, so I would suggest that the burden of proof is on the people who keep singletons to show that they can make good biological inferences from their OTUs. There is no right or wrong answer here – it’s a choice between specificity and sensitivity, like setting a BLAST E-value threshold. If you use the BLAST default E=10, then you may get many spurious hits, and if you keep singleton reads, you may get many spurious OTUs.

The fastq_mergepairs command in usearch calculates the posterior error probability for each base in the overlapping region using Bayes’ Theorem, which gives the Phred score. The method is described in Edgar & Flyvbjerg (2005), doi: 10.1093/bioinformatics/btv401. See also these pages in the usearch manual: