Subsampling 3 different groups at same level

I’m hoping to get an opinion on subsampling.
I have three sources of digesta that I’ve run separately. They were all subsampled to 11000 like in this example: sub.sample(, size=11000).

They could have been subsampled like this, based on the lowest reasonable number of sequences I have to deal with:
sub.sample(, size=11000)
sub.sample(, size=9898)
sub.sample(, size=8751)

My assumption was to subsample to one level (11000) even though I’m not currently comparing between digesta samples (mostly because my runs get killed on my HPCC when I try to run all samples instead of three groups).

I do lose some mice from the analysis when I subsample to 11000.

Should I subsample to 11000, 9898, and 8751 and not lose mice from the analysis? Or should I subsample all to 11000 to keep consistency?


I’m not 100% clear on your question, but 8571 sequences per mouse is a lot. If I had the choice between 8571 and 11000 sequences and a few samples, I’d go with 8571 and use all of my samples.


Hi Pat,
Thanks for the reply. Let me clarify. I have three different digesta sources: small intestine, cecum, and colon. My lowest number of seqs in the colon is 11000. My lowest reasonable number of seqs in cecum is 8571, and lowest in small intestine is 6457. Out of 147 samples from the small intestine, I have about 10 samples with <5000 seqs. Initially, I subsampled all to 11,000 to keep things consistent and I was hoping to analyize all three digesta sources against each other…but 441 samples is a pretty big data set to run and my runs get killed by the HPCC administrator.

If I subsample to 11,000 seqs in the small intestine, I lose about 15 mice from analysis. I lose 1-2 mice from the analysis for cecum and colon (there are 1-2 mice from each group with <100 seqs).

By your response, I think you understood my question?

I’d use the same number for all of them so you can compare easier. I’d also go below the lowest number you want to include because subsampling 11000 to 8700 is going to show more variability than subsampling 8700 to 8700.

What is your hpc admin complaining about? processors? ram? wall time? Once you have OTUs, subsampling even for beta diversity shouldn’t take that much power/memory/time

Thanks kmitchell,

For the whole lot of 447 samples my wall time expires. When I run by digesta source, I usally can get a run completed in <4 days. I’ll have to try and subsample as you suggested.

It’s taking 4 days to subsample? I’d be surprised with even 2 days for hundreds of mouse samples (I run on 32 cores). Are you following the SOP including the unique.seqs and pre.cluster steps? What is this data? MiSeq 2x250 or somthing else?