tips on subsampling, feature request?

In SOP it suggested sub-sampling to normalize sequences between samples.
Depending on samples the amount of sequences can vary wildly (well the more the reason to normalize of-course), though have you any suggestions at what level I’d better discard the smallest samples. I understand that there should be no problem whether the smallest sample is 10 0r 100X smaller from the largest - as long as this small number of sequences subsampled from the largest one is still a representative sample. Are there any good quick wayst to estimate that - and maybe there could be a built in feature in mothur. To estimate if the smallest sample is large enough to have represenative number of sequences from the others etc. So users can decide if to subsample to this number or throw out the sample alltogether.

On the other hand Cárcer, Denman, McSweeney, & Morrison, 2011 - http://www.ncbi.nlm.nih.gov/pubmed/21984239
Have suggested that sub-sampling to the median might be an improvement over the subsampling to minimum strategy - any ideas about that … or about feasibility of how such subsampling might be conducted in mothur?

Sorry, but subsampling to the median makes zero sense to me. For half the samples you’d be making up data. Again, doesn’t make sense to me, but I might be missing something.

Yes I thought I already saw a discussion about it a few years ago in this forum :smiley:
I do think I side with you in this case.
But on the other topic, have You (or any other forum members) any good quick hints at evaluating the required sample size for subsampling.
(etc. some rules to decide whether I should use my smallest sample as measure for sampling or just toss it alltogether and take the next one.)

It won’t help for determining the best number to go with, but there was a paper from Rob Knight’s lab a few years ago in PNAS that looked at sequencing to a stupid depth (~3.1 million reads per sample). I haven’t read the paper in a while but they concluded that 2,000 reads per sample is sufficient to give the same conclusion as the full data set. The link is http://www.pnas.org/content/early/2010/06/02/1000080107.

Only an anecdote, but one of the bioinformaticians at our institute has also told us something similar, that ~1.5 - 2k reads is enough, after that you’re better off spending your sequences on replicates rather than depth.

Oh thats good to know. Should be no major problems with any of my datasets then.

It won’t help for determining the best number to go with, but there was a paper from Rob Knight’s lab a few years ago in PNAS that looked at sequencing to a stupid depth (~3.1 million reads per sample). I haven’t read the paper in a while but they concluded that 2,000 reads per sample is sufficient to give the same conclusion as the full data set. The link is > http://www.pnas.org/content/early/2010/06/02/1000080107> .

Only an anecdote, but one of the bioinformaticians at our institute has also told us something similar, that ~1.5 - 2k reads is enough, after that you’re better off spending your sequences on replicates rather than depth.

Well… it should be obvious by now that I like to be contrarian. I would say 2k is enough, but to do what? I think they also have one that says you need like 200 to differentiate your feces from your mouth. If you’re going to make a more interesting comparison or you want to do something other than beta-diversity, then you might want more. Frankly, I think beta diversity is overplayed and doing an OTU-by-OTU analysis is far more interesting. But you need a lot of reads to get enough coverage of those rare bugs so that you can see things a decent number of times to do stats on them.