So some background. I am performing analysis on some samples that vary greatly in terms of sequence number (10,000 vs 500,000+). I am subsampling down to the smallest sample size (10,000) and what I noticed is that the distribution of the OTUs in my large samples after subsampling are nothing resembling the original samples. Specifically, all OTUs were brought to a total abundance of <100 even though in the original sample the most abundant OTUs numbered in the 10,000s, so I wouldn’t expect them to be so low after sub-sampling. Also the sub sampled samples were much much more even than the original samples.
To test this some more I ran sub.sample on a mock shared file which can be found here (http://pastebin.com/W6zVGfHp) and ran this (http://pastebin.com/TFKs1UyR) python script (in an ipython notebook). What I found was that the sub.sampling is very uneven:
This graph shows 10 subsampling runs on the same shared file, where the subsamples are at 10/10, 9/10, 8/10, etc of the original shared file size. The OTUs are arranged along the x axis by their initial abundance (1, 10, 100, 1,000, 10,000, 100,000), their current abundance in the subsampled shared file on the y axis (log10), and color coded at each of the initial abundances according to the replicate OTU number (the shared file contained one sample, with 24 OTUs, that were actually 6 OTUs with the staggered abundance, repeated 4 times).
What is apparent, is that some OTUs remain virutally untouched, even when the shared file is subsample to 1/10th the original size, while others have essentially been eradicated.
This subsampling was performed using mothur 1.39.3.