I used mothur for the analysis of my 16S rRNA and ITS2 dataset from soil samples. Mothur team especially Sarah was kind enough to guide me throughout my analysis. I have submitted my manuscript to the journal and received comments from the reviewer. One of the comment is related to the number of OTUs and his statement is given below,
“There also seems to be problems with the quality of the bacterial reads- the fact that the rarified to 3,500 when they had 75K + reads. So only being able to identify a few hundred OTUs of bacteria is very low- my experience is that you should get 5000+ OTUs”.
When I normalized the data then my smallest sample has about 3500 sequences ultimately, number of OTUs were also very low. Even without normalization there was a not a big difference in number of OTUs. While reviewer has issues that with 75K + reads why are the number of OTUs in hundreds? Moreover, Chao1 estimates OTUs almost double than the observed OTUs. In contrast when I used ITS2 dataset, the number of OTUs were more compared with 16S. Please help me to reply the reviewer about his question regarding less number of OTUs instead of his rough estimate of about 5000+ OTUs from same soil samples.
Your quick response in this regard will be highly obliged.
how many samples had 3500 reads? how many OTUs do you get if you subsample to 10000 (dropping the samples that don’t have that many). I agree that a few hundred OTUs in a soil sample is low. But I dont’ think that subsampling level would change the number of OTUs very much.
I have total 18 samples out of which two samples contain approximately 3500 sequences (~830 OTUs), one sample 3800, 3 samples 4500, one sample has > 11000 sequences. The sample with > 11000 sequences yielded 1347 OTUs while the highest OTUs (1431) were in the sample having 9974 sequences. These number of OTUs are without normalization.
Thank you @pschloss.
I think these are less and accurate as mothur employs single linkage preclustering algorithm to cluster the reads. That is why it is estimating actual OTUs instead of spurious ones.Am I right? Should I reply the reviewer in this way?
The sequence processing pipeline was previously vetted by Kozich et al. (citation) and using a mock communities was shown to result in an error rate, <0.02%, that is an order of magnitude lower than any other pipeline that is being used. It is possible that claims of thousands of OTUs are inflated because those counts were obtained using other methods with higher error rates resulting in many more spurious OTUs. Furthermore, it is impossible to get an even number of sequences per sample and so it is critical that we control for uneven sampling by using rarefaction or subsampling. With this in mind, we rarefied our samples to 3500 reads per sample. Clearly, there is no way that we could find 5000+ OTUs from 3500 reads. Because the number of OTUs observed is affected by the number of sequences sampled, it is not possible to directly compare the number of OTUs we observed to the number the reviewer observed. It is only possible to compare the number of OTUs in two communities when the same methods and same number of sequences are considered.
Thank you very much for your elaborated response @pschloss. Once again I am very thankful to you and your team especially Sarah for immediate response and very friendly behavior during conversation related to mothur making its use easier for a newbie.
I think the point from the reviewer is why you choose to rarefy reads from 75K+ to 3,500 reads. For regular soil samples, couple hundred of OTU after mothur pipeline is actually very low based on my experience. AND I don’t think rarefy reads to 3,500 is a good idea. I would guess that this 3,500 is the minimum read depth across all of your samples. What I will do is to look into the read depth distribution and decided the best thresholds for the rarefaction. In this case, you may discard couple of samples. But as long as you have enough replicates, you can definitely make this trade off (i.e., increasing threshold for rarefaction by discarding couple of samples).
—PS, after looking into the read depth across all of your samples, I think you just have relative low sequencing depth. The trade off I mentioned above won’t help a lot. For me, I usually have way less ITS2 OTUs compared with 16S OTUs. If you compare the sequencing depth of your ITS2 and your 16S dataset, you may have some idea. It is also possible, for some reason the diversity of bacteria in your soil sample are actually very low. Try to plot a rarefaction plot would help to infer whether you have enough sequencing depth for your 16S data or not.
Just remember that people were getting Science/Nature papers with < 500 sequences per sample 10 years ago. The read depth isn’t as big an issue as people make it out to be. Just calibrate your conclusions accordingly and you should be fine
to add my 5 cents, the number of OTUs will obviously also depend on the cutoff (“label”). the high number of OTUs the reviewer is looking for maybe comes from the fact that they use a 1% cut-off, 99% sequence similarity, as proposed by Edgar RC. (Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics. 2018 Jul 15;34(14):2371-2375).