I am currently working with an V6 data set that was sequenced via illumina Hi-seq; due to the large number of sequences we have for each sample, I am having difficulty analyzing the full data set in mothur. A colleague of mine suggested using a sub-sampling strategy in order to decrease the number of sequences for each sample. This approach seems to work well, and has eliminated the problems I was having in mothur. Of course, one major question I have about using this strategy is at what depth will sub-sampling accurately represent the full data set (or at least come reasonably close). Does anyone have any suggestions as to which measurements go by in order to compare the sub-sampled data to the full data set??
The Good’s coverage calculator seems like an obvious place to start, but I’m not sure what else to compare.