I would like to hear your opinion about the ASVs vs OTUs. Do you think that by using amplicon sequence variants we can get a better estimate of the true diversity of a community? Or do you think the technology is still not ready for that? What would be the advantage of continuing using OTUs?
I used the dada2 to get an ASV table from some previously analyzed data (mothur 97%) and the truth is that I got way fewer taxa (ca. 6x less: ca. 3000 OTUs vs 500 ASV). I can see how representative OTUs may inflate the diversity, but isnt that a bit extreme?
Finally, if you think that maybe we can start working with ASVs instead of OTUs, should we expect an option at mothur soon?
Here is a comment by Fierer et al., about the subject from last year (see link). A brief summary about Pros and Cons:
I do not want to speak on behalf of Pat, but you can read his comment:
Hope it help to begin the discussion.
Thanks Vingomez! Hadnt seen that!
I really like Pat’s comments, I also believe in the 97% cutoff and I strongly concur that there is a limit to what 250bp of 16S can tell us.
My biggest ‘concern’ is assessing the diversity of complex bacterial communities, where you can get thousands of OTUs or hundreds of ASVs. Of course I know that most likely we will not find out in this lifetime what is the TRUE diversity there (unless we genome sequence every single strain separately… good luck with that ) but I would like to know the one that is closest to the truth. I assume that the reason that dada2 gives fewer taxa (ASVs) is because it has a much more aggressive filtering approach (to remove sequencing errors) but certainly it drops a lot of rare OTUs?
Then again, is there a point of dropping the rare OTUs and keeping only the ones that we are certain about (e.g. the ones above 10 reads perhaps?), in order to create a reduced dataset to compare with?
anyway, its all relative I guess, depending on how we choose to analyze our datasets, in the end its not about the software but about how we interpret the outcome
I would like to revive this topic, after reading the ISME publication by the DADA2 people (https://www.nature.com/articles/ismej2017119). It is my opinion that this does merit some more conceptual attention on the mothur forum. Although the discussion has been started before on this forum (e.g. OTUs or sequences? on ASVs and oligotyping, OTU classification and minimum entropy decomposition on MED), I feel that there is no consensus on when to use the one or the other.
In the past my experience with ESV’s has been limited to oligotyping to a specific taxon (on relatively abundant OTU representing it) among different conditions and sometimes seeing different oligotypes popping up between conditions (i.e. the use as suggested by @dwaite in OTU classification and minimum entropy decomposition).
Nevertheless, recently I was clustering 350 samples of full-overlap V4 data on a fairly powerfull machine and again getting absurdly large distance matrices (http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/, using the MiSeq SOP with mothur 1.40.5) and painfully realizing that to have comparability among samples, they always need to be clustered together at once. On of the points that Callahan, McMurdie and Holmes make in their ISME perspective is that with ASV’s this is a non-issue as a certain ASV can be infered on a single-sample basis and compared to any amount of other samples (i.e. the grouping is independent on dataset size). I think that it is not entirely fair, because a beta-diversity metric will be inflated if new ASVs occur in new samples (i.e. because of the amount of zeros will increase in the original set as the ASVs do not occur there), just as it would with OTUs.
In all fairness, I have not yet thoroughly evaluated ASV’s myself (of course the DADA2 people claim it is amazing, but everyone believes their tool is amazing ). I will look at some mock (ZymoBIOMICS community standard) data in the coming weeks with DADA2 and mothur for V3-V4 and hopefully also for full-overlap V4 to get a better idea, but I’d like to have a more conceptual discussion on the value of (97%) OTUs vs ASVs.
For instance: what is the prevailing opinion on ecological consistency/robustness (e.g. https://www.ncbi.nlm.nih.gov/pubmed/24763141 and https://msphere.asm.org/content/2/2/e00073-17) and biological conistency/interpretation of OTUs vs. ASVs (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5812548/ and the discussion between Schloss and Edgar stated above)? For instance: How well does the assumption that “biological sequences are more likely to be repeatedly observed than are error-containing sequences” (Callahan et al. (2017) ISME) hold, e.g. in the case of the divergence in rRNA operons as stated in the comment by Fierer et al.?
I agree that it probably doesn’t make sense to use a marker gene study to distinguish pathogenic species/strains from others in the same genus based on ASV/ESV’s (given the resolution of most single-marker genes and the read length of MiSeq). But I do think that the argument that is being made about interoperability/re-usability of ASV’s is one that deserves a second look.
As a side-note: with the advent of full-length 16S NGS amplicon seq (e.g. on the PacBio platform, https://peerj.com/articles/1869/), with decreasing error rates, would ASV’s become more relevant ?
My biggest scientific concern with dada2 implementation of ASV is the error correction which is tied to the removal of all singleton sequences. Maybe they’re all sequencing error? maybe. Mostly, I think ASV on Illumina data is pushing technology beyond its precision point
I think with reference to my side-note this paper is definitely worth the read: https://www.nature.com/articles/s41467-019-13036-1
“I n particular, we caution against the conclusion that quantifying exact sequence variants (ESVs) is preferential to more traditional OTU-based approaches16. This conclusion assumes that ESVs represent a more meaningful taxonomic unit than OTUs. Given that the majority of bacterial isolates we sequenced contained multiple, variant copies of the 16S gene within their genome, this assumption may not always be correct. The potential for 16S copy variants to bias estimates of bacterial diversity is well established27, and we and others25 have shown the number of unique sequences detected in a mock community is far greater than the number of species known to be present .”
I think length (aka amount of data per sequence) is going to continue to be the biggest issue. I don’t find Jethro et.al’s finding that v4 along can’t be ID’d to species surprising (this is basically what everyone has been saying). It’s not something special about v4, it’s just too short.
I’m working on longer amplicons with minion, who knows if that is going to be any better given the error rate but it feels worth a try.