Dealing with sequence errors in alpha diversity calculations

Hi all,

I am analyzing a data set in which I included a mock community with 20 species. After clustering (using usearch), removing contaminants etc, I looked at the mock sample and found that it had 19 OTUs with abundances > 400 sequences, 2 OTUs with about 40 reads, and then 88 with 1-18 sequences. Is this common? The majority of these unwanted low abundance reads are abundant in the other samples from this dataset, so I wonder if the problem could be caused by barcode switching. If I can’t rule out that this happened in my “real” samples, then how do I calculate numbers of species per sample, for instance? I understand that for alpha diversity measurements I should include everything, including singletons. Anyone has any advice/recommendations?

Thanks in advance


I would not report absolute diversity metrics, rather I would report them relative to your other samples. As you are findign there are a variety of sources of error (including contamination of your mock community from reagents, etc). By treating evertyhign as “relative” you can safely assume that everything is equally good/bad.


Thanks Pat, appreciated!