Hello everyone,

I am currently working with an V6 data set that was sequenced via illumina Hi-seq; due to the large number of sequences we have for each sample, I am having difficulty analyzing the full data set in mothur. A colleague of mine suggested using a sub-sampling strategy in order to decrease the number of sequences for each sample. This approach seems to work well, and has eliminated the problems I was having in mothur. Of course, one major question I have about using this strategy is at what depth will sub-sampling accurately represent the full data set (or at least come reasonably close). Does anyone have any suggestions as to which measurements go by in order to compare the sub-sampled data to the full data set??
The Good’s coverage calculator seems like an obvious place to start, but I’m not sure what else to compare.



Jeremy, in case you are still searching for an answer, this could help:

Aguirre de Cárcer D, Denman SE, McSweeney C, Morrison M. Evaluation of
subsampling-based normalization strategies for tagged high-throughput sequencing
datasets from gut microbiomes. Appl Environ Microbiol. 2011 Oct 7.