tips on subsampling, feature request?

jenz · February 3, 2014, 9:57am

In SOP it suggested sub-sampling to normalize sequences between samples.
Depending on samples the amount of sequences can vary wildly (well the more the reason to normalize of-course), though have you any suggestions at what level I’d better discard the smallest samples. I understand that there should be no problem whether the smallest sample is 10 0r 100X smaller from the largest - as long as this small number of sequences subsampled from the largest one is still a representative sample. Are there any good quick wayst to estimate that - and maybe there could be a built in feature in mothur. To estimate if the smallest sample is large enough to have represenative number of sequences from the others etc. So users can decide if to subsample to this number or throw out the sample alltogether.

On the other hand CÃ¡rcer, Denman, McSweeney, & Morrison, 2011 - http://www.ncbi.nlm.nih.gov/pubmed/21984239
Have suggested that sub-sampling to the median might be an improvement over the subsampling to minimum strategy - any ideas about that … or about feasibility of how such subsampling might be conducted in mothur?

pschloss · February 3, 2014, 4:12pm

Sorry, but subsampling to the median makes zero sense to me. For half the samples you’d be making up data. Again, doesn’t make sense to me, but I might be missing something.

jenz · February 3, 2014, 4:24pm

Yes I thought I already saw a discussion about it a few years ago in this forum
I do think I side with you in this case.
But on the other topic, have You (or any other forum members) any good quick hints at evaluating the required sample size for subsampling.
(etc. some rules to decide whether I should use my smallest sample as measure for sampling or just toss it alltogether and take the next one.)

dwaite · February 3, 2014, 6:46pm

It won’t help for determining the best number to go with, but there was a paper from Rob Knight’s lab a few years ago in PNAS that looked at sequencing to a stupid depth (~3.1 million reads per sample). I haven’t read the paper in a while but they concluded that 2,000 reads per sample is sufficient to give the same conclusion as the full data set. The link is http://www.pnas.org/content/early/2010/06/02/1000080107.

Only an anecdote, but one of the bioinformaticians at our institute has also told us something similar, that ~1.5 - 2k reads is enough, after that you’re better off spending your sequences on replicates rather than depth.

jenz · February 4, 2014, 7:52am

Oh thats good to know. Should be no major problems with any of my datasets then.

pschloss · February 4, 2014, 1:05pm

It won’t help for determining the best number to go with, but there was a paper from Rob Knight’s lab a few years ago in PNAS that looked at sequencing to a stupid depth (~3.1 million reads per sample). I haven’t read the paper in a while but they concluded that 2,000 reads per sample is sufficient to give the same conclusion as the full data set. The link is > http://www.pnas.org/content/early/2010/06/02/1000080107> .

Only an anecdote, but one of the bioinformaticians at our institute has also told us something similar, that ~1.5 - 2k reads is enough, after that you’re better off spending your sequences on replicates rather than depth.

Well… it should be obvious by now that I like to be contrarian. I would say 2k is enough, but to do what? I think they also have one that says you need like 200 to differentiate your feces from your mouth. If you’re going to make a more interesting comparison or you want to do something other than beta-diversity, then you might want more. Frankly, I think beta diversity is overplayed and doing an OTU-by-OTU analysis is far more interesting. But you need a lot of reads to get enough coverage of those rare bugs so that you can see things a decent number of times to do stats on them.

Topic		Replies	Views
Question regarding subsampling Theory behind mothur	9	9260	March 4, 2013
other options (e.g. median and mean) for subsampling Commands in mothur	1	811	June 26, 2017
Normalization Commands in mothur	1	4332	May 8, 2012
Sub-sampling Commands in mothur	2	2885	December 2, 2011
Diversity comparisons between different sized datasets? Theory behind mothur	15	7513	March 18, 2015

tips on subsampling, feature request?

Related topics