Dear mothur users,
I am a new mothur user. I have been using other metagenomics pipelines for analyses. I wish to know where the calculated sequencing error rate is used in the subsequent steps in mothur. For instance, in QIIME, the error rate calculated from MOCKs is used to perform abundance filtering of OTU table. Thus, spurious OTUs are significantly reduced or eliminated. For mothur, I do not see where the sequencing error rate is applied (once calculated).
Thanks in advance,
It’s main use is for knowledge. I would also question whether you could use it the way you describe it in QIIME. Seems like you’d be risking pretty high false positive rates and it would get messy when different samples have different levels of sampling
The strategy I employ is to get the proportion of the most abundant non-MOCK bacteria (relative to the total abundance of ALL the bacteria in the MOCK). The OTU table is then subjected to an abundance filtering using this threshold (usually expressed as a fraction), which can optionally be used together the cut-off for eliminating low abundant occurring OTUs (besides singletons, doubletons, tripletons, …). Thus, any OTU whose proportion (or total observation count) is less than (any of) the threshold(s), is discarded. I know this strategy has its own limitations (depending on the microbiome question that one is answering), but it significantly reduces spurious OTUs. I thought there was a way of performing this in mothur, so that the final OTU table used in the downstream analyses (including diversity analyses) is “clean”.
Hrrrmmm. The problem with this, IMHO, is that a mock is designed for the populations to all be within an order of magnitude of each other. So using the abundance of the most abundant error, would probably remove a lot of good sequences.
Thanks for your constructive thoughts. You are right, but erroneous discarding of good sequences mostly happens if PCR and sequencing steps were problematic. I have previously done this abundance-filtering with success. If the error rates are low, then I do not have to worry much. Do you have any “abundance-filtering” recommendations? I wish to make use of the sequencing error rate to minimize or eliminate “exaggerated” microbial diversity, thus minimizing or eliminating false discovery rates.
I would question how you know you have done the filtering “with success”. What kind of control did you have and how did you show/define if it was successful?
My recommendation is to no do any abundance filtering for alpha and beta diversity analyses. Instead, use rarefaction to compare all samples on an equal footing. There may be errors in individual samples, but by rarefaction, you assure that they’re all equally treated.
Thank you for your suggestion(s). There is an unpublished study titled “Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identification of microbial abundances” where we validated 16S rRNA amplicon sequencing and bioinformatics tools (mothur, QIIME, “QUPARSE” [QIIME with UPARSE] and “RiboPicker”) for microbiome analyes. We were 16 HMP mocks (8 even and 8 staggered) each of which had 20 bacteria of known identity and copy number. I’ll only mention a few findings here… All the four pipelines gave similar distributions of the 20 bacteria. There were significant differences in the false-positively assigned reads and OTUs (at genus level). mothur and QUPARSE (the pipeline I mostly use) had similar and significantly lower number of false negative genera (at different thresholds on relative abundances), and false positive reads and genera than QIIME and riboPicker (p-value < 0.001). Compared with QUPARSE, mothur outperformed QUPARSE, in terms of reads that were finally retained. QUPARSE retained about 10% less reads than mothur. This is because I had stringently optimized QUPARSE for high-quality reads, and probably because two different databases were used.