Pipeline optimization: total reads vs unique reads


FIrst thanks for this wonderful software and for keeping it so relevant.

I am working on pipeline optimization and have a mock with 8 bacteria and 2 fungi.

I get about 1200 OTUs and our error rate is 0.08%. I see this is mainly due to the large number of reads going in (around 100-200k). Most reads are coming from those first 8 or so OTUs.

I saw in your 2013 paper you used 5000 total reads per sample but am worried this may not be enough for complex samples.

Might I use 5000 unique reads per sample instead?

What do you think?


I think people make too much about the number of reads going into a sample. Regardless of the number of reads you use, you need to have the same number of total reads (not unique reads) in each sample when you rarefy the data.


1 Like


I tried making a rarefaction plot manually with my 120K reads and 1200OTUs by downsampling the reads and thus reducing OTUs proportionately.

This did not work because apparently rarefaction plots are made by rarefying unique reads.

They’re made using the total number of sequences, not the number of unique sequences.