alpha diversity after remove low abundance

Hi everyone,

I just got my first Miseq dataset and I amplified the V3-V4 16S rDNA region (2x250 bp). After the clustering, i obtain a lot of OTUs (until 16.000 for one sample).

I am novice, and the question i ask myself is whether i should really after the normalization (ie the same number of sequences for all samples) if i have to remove the singletons and/or the low abundance (OTUs < 0.005%) or not. Indeed, I did the rarefaction curves without and with the removal of singletons.
When i keep the singletons, the rarefaction curves not reaching the asymptote whereas when i remove the low abundance the rarefaction curves reach the asymptote.

I wonder if that is disturbing or not reaching the asymptote ?

I have an another question, what are the calculators available to know the alpha diversity when we remove the low abundance because if i understand the chao1 estimator takes into account the singletons and doubletons and i can’t use it in this case.

Can anyone help me ?

All answers will be appreciated !


I think the richness estimators are of little use for microbial communities, used to look at them back when I was doing clone libraries but when I moved to 454 I stopped. I don’t remove singletons and doubletons, I think that’s a pretty blunt object for trying to improve data quality. Given how incompletely we sample the communities, there is no reason to believe that singletons are bad sequences if their quality scores are good (natural communities, we are likely actually sampling organisms that are incredibly rare in contrast to the relatively abundant organisms that we have in mock communities). Mothur subsamples repeatedly when calculating the alpha and beta diversity (YEAH!) so the influence of the very rare sequences is likely mostly removed simply by subsampling.

I much prefer to use an alpha diversity index (inverse Simpson is my personal fav) rather than just richness. When I do look at richness, I just report richness detected rather than estimated richness. I don’t think there’s good evidence that any of the richness estimators increase our understanding of the communities and they introduce potential bias in how the numbers are calculated.

I also stopped looking at rarefaction curves a long time ago too, as a practicality it’s hard to look at hundreds of curves and get anything meaningful out of them. I calculate Good’s coverage but use that mostly to see if it looks like similar samples have similar coverage-technology check rather than trying to understand the community

Thank you very much for your answer :slight_smile: , but i wonder some questions.

The singletons and doubletons are frequently associated to chimeras and as there are also multiple copy of the 16S rDNA and more when we do amplicons by PCR. So, i would have thought that we must remove the low abundance when the rarefaction curve not reaching the asymptote because this means that we have not enough of sequences.

So, for you, is that a problem if the rarefaction curve not reaching the asymptote ? if i have a lot of OTUs (and many singletons) ? and my Good’s coverage are including between 0.70 and 0.80, it is good ?


You should run a chimera check to remove chimeras, just like you should use the quality scores to remove poor sequences. Treating all singletons as suspect implies that abundant sequences aren’t suspect but if you have a chimera form the first few cycles of PCR you will have many seqs for that chimera. You can do some back of the envelop calculations for how much of your original sample you are theoretically sampling when you extract DNA then use a small fraction of that for PCR then use a small fraction of that for sequencing, so coming up with singletons shouldn’t be that surprising. Of course lots of people toss them so you’ll not have a problem publishing if you toss, I just don’t think it’s justified.

What are you samples? I’d usually get <0.7 goods coverage for soil samples but fecal it’s usually >.9

Yes i am agree with you, i have launched a chimera de novo check to remove chimeras.
My samples come from soil and plant stem.

What disturbs me most is the too many OTUs obtained. I did not find a lot example in the bibliography with so much OTUs. Generally it is on the order of a hundred OTUs. I post you the rarefaction curves. What do you think about it ?

Another big source of inflated OTU numbers is that you are using the V4-V5 region with 250 PE sequencing. This means that your reads do not fully overlap and that you are going to get an error rate 10-fold higher than you would get with the V4 region. The result? A lot of extra OTUs. We go over this in the Kozich manuscript if you are looking for specifics.

I agree with the earlier posters about removing singletons and doubletons. Furthermore, if your samples don’t have the exact same number of read (they never do) then one sample that has 1000 reads and another with 10000 will be treated differently if you remove singletons and doubletons. Finally, FWIW, we see some chimeras showing up dozens of times - they are not random artifacts.


Hi pat,

Thanks you for your reply.

Yes, i read the Kozich manuscript and i have seen it but in my pipeline, during the make.contigs i set the insert parameter to 26 and for the next step, screen.seqs command, i set the maxambig parameter to 0. So, i would have thought that the sequences with a bad overlap was removed. But, this is not necessarily the case ? How can I be more stringent ?

I did a length distribution of my sequences after these steps and I have obtained three clusters length at 436, 456 and 461 bp. It is exactly the same distribution length that in the databases(silva and greengenes), which conforted me even more.

What do you think about it ?



my last postdoc was looking at forest soils with v1-3 (454). I had ~257k OTUs from ~700 samples. Unless your soils are sterilized prior to planting, I’d be very suspicious of the 100 OTUs in soil

I think there was a little mistake. For the samples from the soil I have ~ 12K OTUs by sample (and I have 16 samples in total) and for the samples from the stem i have ~ 5K OTUs by sample (and I have 14 samples in total). However, I think I have too OTUs as told Pat and this may be due to a bad overlap.

It’s not exactely the same question but… even at the risk to underestimate the richness and diversity I could catch with sequencing, I’d like to remove low abundant OTUs before starting analysis, even alphadiversity. I know many people do not agree with this (others do), but if anyone can indulge me and help me find out how I could do this to see the output and compare with other results…

  • I have the final.shared and final.taxonomy, also final.count_table and final.list files.
    -I want to continue working with true abundances (not relative abundances) as input for summary.single and rarefaction.single (and as far as I understood, this two should be done with the final shared file and not the subsampled shared file, is that true?)
  • how do I remove OTUs with a relative abundance <0.005% and still have a filetered shared file with the original actual and not relative abundances?
    -and I guess I would have to classify.otu after filtering the shared file again?
    Thank you!