I am currently working with a 16S rRNA gene dataset, and would like to use the Schloss SOP for my analysis. However, I am not sure how I should handle the subsampling step, as I have low numbers of sequences for some of my samples (I have a few samples with only 600 sequences compared to 7000 for some others). I understand that I have to subsample to the lowest number of sequences, but wanted to know if 600 would be too little. We have 94 samples, so we don’t mind throwing away a few of them, if it improves our analysis in the end.
Does anyone know if there is way of determining the minimum number of sequences we should subsample down to? Any advice would be greatly appreciated.
It’s really a judgement call. Make a histogram of the number of reads per sample and look at where the various samples are on that distribution. You may be willing to throw out some samples because they aren’t as interesting as having more sequence reads from other samples. It’s really a judgement call.
I thought this was an interesting article: Aguirre de Cárcer et al., 2011 Evaluation of Subsampling-Based Normalization Strategies for Tagged High-Throughput Sequencing Data Sets from Gut Microbiomes. Applied and Environmental Microbiology 77, 8795-8798
We have came across with similar question. This article mentioned above suggest taking the median number of sequences and normalizing to this number. We think that this approach should result in losing a lot less information than subsampling to the lowest number of sequences. And samples that have smaller number of sequences than this median should be ignored not discarded. This way we normalize samples that may be overestimating diversity and retain small samples. Maybe you could add this option to sub.sample or normalize.shared commands?
This article mentioned above suggest taking the median number of sequences and normalizing to this number. We think that this approach should result in losing a lot less information than subsampling to the lowest number of sequences. And samples that have smaller number of sequences than this median should be ignored not discarded.
I think the difference between “ignored” and “not discarded” is a distinction without a difference. Sorry, but the approach still doesn’t make any sense. Why not just subsample to the median if that’s what you want?
What my request was to add an option not to remove any groups that have smaller number of sequences than I select. (In normalize.shared wiki it is written “If you set norm greater than an abundance of a specific group the group will be removed.” I don’t want to remove those groups.) Is it possible to add this option to sub.sample and normalize.shared commands?