uneven Sub.sample-ing

campenr · February 28, 2017, 4:24pm

Hello,

So some background. I am performing analysis on some samples that vary greatly in terms of sequence number (10,000 vs 500,000+). I am subsampling down to the smallest sample size (10,000) and what I noticed is that the distribution of the OTUs in my large samples after subsampling are nothing resembling the original samples. Specifically, all OTUs were brought to a total abundance of <100 even though in the original sample the most abundant OTUs numbered in the 10,000s, so I wouldn’t expect them to be so low after sub-sampling. Also the sub sampled samples were much much more even than the original samples.

To test this some more I ran sub.sample on a mock shared file which can be found here (http://pastebin.com/W6zVGfHp) and ran this (http://pastebin.com/TFKs1UyR) python script (in an ipython notebook). What I found was that the sub.sampling is very uneven:

This graph shows 10 subsampling runs on the same shared file, where the subsamples are at 10/10, 9/10, 8/10, etc of the original shared file size. The OTUs are arranged along the x axis by their initial abundance (1, 10, 100, 1,000, 10,000, 100,000), their current abundance in the subsampled shared file on the y axis (log10), and color coded at each of the initial abundances according to the replicate OTU number (the shared file contained one sample, with 24 OTUs, that were actually 6 OTUs with the staggered abundance, repeated 4 times).

What is apparent, is that some OTUs remain virutally untouched, even when the shared file is subsample to 1/10th the original size, while others have essentially been eradicated.

This subsampling was performed using mothur 1.39.3.

Thoughts?

Richard

westcott · February 28, 2017, 7:57pm

Could you post the command you ran in mothur?

campenr · February 28, 2017, 8:16pm

Hi Sarah,

Here is a pastebin of my logfile. The sub.sample command is run identically each time, on the original shared file, just subsampling down to different sizes.

Cheers
Richard

campenr · March 2, 2017, 2:18pm

Just wondering if this is being looked into at all?

campenr · March 2, 2017, 7:53pm

For comparison, this is what I see if I do the exact same thing, but use another software package to do the sub sampling (the subsample_counts function in the python package scikit-bio).

As you can see the abundance distribution is maintained across the sub samples, whereas it does not appear to be when using mothurs sub.sample.

westcott · March 6, 2017, 9:55pm

Thanks for your help in finding and resolving this issue. In mothur we relied on the standard c++ function rand() to randomize. The implementation of this function is not defined by standard c++ and varies slightly for the Windows platform which caused the discrepancy. I have updated mothur to use the seeded mersenne_twister_engine to randomize which resolves this issue. Here is a link to the latest release, https://github.com/mothur/mothur/releases/tag/v1.39.4.

campenr · March 7, 2017, 2:16pm

Hi Sarah,

Thanks for fixing that. The sub.sampling appears to be working as expected on my system now in 1.39.4.

Kind Regards
Richard

Topic		Replies	Views
sub.sample question Commands in mothur	1	1940	February 25, 2014
sub.sample Commands in mothur	8	12761	April 12, 2012
sub.sample() under win10 crashed mothur bugs	1	942	May 6, 2017
Many observed OTU Commands in mothur	3	108	March 17, 2024
Pseudo-replication is underestimating # of different Otus? Commands in mothur	5	2881	March 15, 2014

uneven Sub.sample-ing

Related topics