uneven Sub.sample-ing


So some background. I am performing analysis on some samples that vary greatly in terms of sequence number (10,000 vs 500,000+). I am subsampling down to the smallest sample size (10,000) and what I noticed is that the distribution of the OTUs in my large samples after subsampling are nothing resembling the original samples. Specifically, all OTUs were brought to a total abundance of <100 even though in the original sample the most abundant OTUs numbered in the 10,000s, so I wouldn’t expect them to be so low after sub-sampling. Also the sub sampled samples were much much more even than the original samples.

To test this some more I ran sub.sample on a mock shared file which can be found here (http://pastebin.com/W6zVGfHp) and ran this (http://pastebin.com/TFKs1UyR) python script (in an ipython notebook). What I found was that the sub.sampling is very uneven:

This graph shows 10 subsampling runs on the same shared file, where the subsamples are at 10/10, 9/10, 8/10, etc of the original shared file size. The OTUs are arranged along the x axis by their initial abundance (1, 10, 100, 1,000, 10,000, 100,000), their current abundance in the subsampled shared file on the y axis (log10), and color coded at each of the initial abundances according to the replicate OTU number (the shared file contained one sample, with 24 OTUs, that were actually 6 OTUs with the staggered abundance, repeated 4 times).

What is apparent, is that some OTUs remain virutally untouched, even when the shared file is subsample to 1/10th the original size, while others have essentially been eradicated.

This subsampling was performed using mothur 1.39.3.



Could you post the command you ran in mothur?

Hi Sarah,

Here is a pastebin of my logfile. The sub.sample command is run identically each time, on the original shared file, just subsampling down to different sizes.


Just wondering if this is being looked into at all?

For comparison, this is what I see if I do the exact same thing, but use another software package to do the sub sampling (the subsample_counts function in the python package scikit-bio).

As you can see the abundance distribution is maintained across the sub samples, whereas it does not appear to be when using mothurs sub.sample.

Thanks for your help in finding and resolving this issue. In mothur we relied on the standard c++ function rand() to randomize. The implementation of this function is not defined by standard c++ and varies slightly for the Windows platform which caused the discrepancy. I have updated mothur to use the seeded mersenne_twister_engine to randomize which resolves this issue. Here is a link to the latest release, https://github.com/mothur/mothur/releases/tag/v1.39.4.

Hi Sarah,

Thanks for fixing that. The sub.sampling appears to be working as expected on my system now in 1.39.4.

Kind Regards