I have been using Mothur for a few years now to analyze V4 and V9 18S rRNA data for metazoan zooplankton communities. Current trends in the metabarcoding field are shifting towards analyses of Exact Sequence Variance (ESVs) instead of clustering sequences at similarity thresholds into OTUs. I was wondering if the “unique sequences” that are identified in Mothur after quality control, filtering, and chimera detection are analogous to ESVs that are generated using pipelines such as DADA2 (Callahan et al., 2016 DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13(7):581-587. doi:10.1038/nmeth.3869). It seems the DADA2 runs sequences through more stringent quality control to ensure detection of PCR errors, fewer false positives for taxonomic identifcations, and low error rates.
I have been classifying the unique sequences identified after running my Illumina data through the Mothur pipeline against a proprietary DNA reference database to get species level identications for zooplankton. I’m curious to know your thoughts on if “unique seqs” would be accepted at ESVs by the scientific community or if I would have to use the DADA2 pipeline for this purpose.
DADA2 doesn’t just quality filter, it models the errors (somehow, I don’t understand how) which includes modeling all rare sequences as errors and throwing them out. 1) I don’t think all rare sequences are errors, 2) I don’t like black boxes that I don’t understand and 3) I think trying to get single nucleotide resolution from illumina amplicon data is expecting more precision than the technology can offer. So I stick with mothur and OTUs.
Currently people seem to be equating dada 2 with ASV, not sure how they’d take unique seqs from mothur as ASV.
In our experiments, mothur and dada2 are just as “stringent”. In fact we see the same sequencing error rates by both methods (compare output from pre.cluster to dada2 output) if all singletons are removed. I think removing singletons is a horrible idea since you are significantly changing the structure of the community. Furthermore, there are some significant problems with dada2 including the need to fit hundreds of parameters leading to an over fit model and the use of corrected P-values that are absurdly small. The end result is a real risk of actually lumping together sequences that should not be lumped together.
This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.