Question regarding subsampling

skeletor · July 20, 2012, 9:05pm

Hello all,

I am currently working with a 16S rRNA gene dataset, and would like to use the Schloss SOP for my analysis. However, I am not sure how I should handle the subsampling step, as I have low numbers of sequences for some of my samples (I have a few samples with only 600 sequences compared to 7000 for some others). I understand that I have to subsample to the lowest number of sequences, but wanted to know if 600 would be too little. We have 94 samples, so we don’t mind throwing away a few of them, if it improves our analysis in the end.

Does anyone know if there is way of determining the minimum number of sequences we should subsample down to? Any advice would be greatly appreciated.

Thanks in advance!

Kyle

pschloss · July 30, 2012, 4:56pm

It’s really a judgement call. Make a histogram of the number of reads per sample and look at where the various samples are on that distribution. You may be willing to throw out some samples because they aren’t as interesting as having more sequence reads from other samples. It’s really a judgement call.

skeletor · July 31, 2012, 11:18pm

Thanks Pat! I appreciate your input.

Kyle

Kirk · August 2, 2012, 1:26pm

I thought this was an interesting article: Aguirre de CÃ¡rcer et al., 2011 Evaluation of Subsampling-Based Normalization Strategies for Tagged High-Throughput Sequencing Data Sets from Gut Microbiomes. Applied and Environmental Microbiology 77, 8795-8798

pschloss · August 2, 2012, 2:00pm

Yeah, but if I remember right they advocate subsampling up, which doesn’t make any sense…

Kirk · August 2, 2012, 2:35pm

they recommend normalization to the median, so not _sub_sampling sensu stricto.

pschloss · August 2, 2012, 8:02pm

Yeah, that doesn’t make sense either - how to you normalize a zero up?

Chris · March 1, 2013, 1:05pm

We have came across with similar question. This article mentioned above suggest taking the median number of sequences and normalizing to this number. We think that this approach should result in losing a lot less information than subsampling to the lowest number of sequences. And samples that have smaller number of sequences than this median should be ignored not discarded. This way we normalize samples that may be overestimating diversity and retain small samples. Maybe you could add this option to sub.sample or normalize.shared commands?

pschloss · March 1, 2013, 5:06pm

This article mentioned above suggest taking the median number of sequences and normalizing to this number. We think that this approach should result in losing a lot less information than subsampling to the lowest number of sequences. And samples that have smaller number of sequences than this median should be ignored not discarded.

I think the difference between “ignored” and “not discarded” is a distinction without a difference. Sorry, but the approach still doesn’t make any sense. Why not just subsample to the median if that’s what you want?

Chris · March 4, 2013, 8:12am

What my request was to add an option not to remove any groups that have smaller number of sequences than I select. (In normalize.shared wiki it is written “If you set norm greater than an abundance of a specific group the group will be removed.” I don’t want to remove those groups.) Is it possible to add this option to sub.sample and normalize.shared commands?

Topic		Replies	Views
Normalization Commands in mothur	1	4332	May 8, 2012
tips on subsampling, feature request? Theory behind mothur	5	5287	February 4, 2014
other options (e.g. median and mean) for subsampling Commands in mothur	1	811	June 26, 2017
Normalization Commands in mothur	4	3442	May 8, 2013
Sub-sampling Commands in mothur	2	2885	December 2, 2011

Question regarding subsampling

Related topics