What's the rationale for sub.sample() after making OTUs

I was wondering why in the Schloss SOP the subsampling step:

sub.sample(shared=final.an.shared, size=4405)

apears after making the OTUs (dist.seqs(), cluster(), make.shared()).

Wouldn’t it make more sense to first subsample the groups and only then generate OTUs?

Thanks in advance

I’m assuming you’re clear that there is a need to sub-sample, it’s just a matter of when. I prefer doing it after make.shared because there is a random aspect to this that would probably require subsampling 1000 times. If you do this before cluster, you’d have to cluster 1000 times (ugh).

Exactly Pat; no argument about why, just about when.
The problem I thought of is that the OTUs might be different if calculated after sub-sampling (not just in their size of course but also in which OTUs are generated and which sequences they include).
An additional advantage is that dist.seqs.() and cluster() would run a lot faster.
I understand that sub-sampling is random what I don’t really understand is where’s the 1000 iteration step.
Why not simply create a random subset of sequences using sub.sample() and then use those to create a distance matrix and OTUs?

Thanks again

the 1000 iteration step would be in the rarefaction.single command and, well, in other commands (this is on the way…). :slight_smile:

Ah… i found this before I asked…

I’m concerned about when to sub.sample also. If the rationale is because the sequencer produces uneven distribution of sequences, then I would assume you would want to sub.sample before classification and OTU clustering? Wouldnt you want to generate an equal bunch of ‘good’ sequences before doing any of the “biological steps” that are not really related to the sequence number output??

If I make an analogy to a ‘grab sample’ of the marine benthos (dont take it too literally, the systems are different), I would prefer to weigh out an equal amount of sediment before counting the infauna rather than count the infauna then sub.sample based on the numbers . The grab = raw sequences, equaling the sediment = sub.sample, counting = classify + OTU… hope this makes sense

We do it after clustering from a practical perspective. If you want to run cluster 1000 times, go for it :slight_smile: I doubt the results will change meaningfully if you do 1000 clusterings or 1000 samplings from the shared file. The classification shouldn’t matter since each classificaiton for each sequence is independent of the other sequences.


I dont understand this 1000 times business (As weith Roey)? Why not just sub-sample once from a sample? I dont agree with this multiple ‘samplings’. It should just be one random sample from a parent sample? You get what you get. This is more of an experimental design/sampling issue.
Is there a misunderstanding here? How does sub.sample work? just one sub.sample? or it does it multiple times?


Sub.sample only does it once. But if you do summary.single, summary.shared, dist.shared, etc, you can use subsample=whatevernumberofsequences. And it will take 1000 random draws and give you an average.

We digress…

I believe, then, that sub.sampling should then be done at the last sequence quality/filtering step before OTU clustering. The subsample should equalise the number of sequences from the machine for each sample.

It is a much of a muchness… but i think i’ll do it this way

thanks Pat for the responses,


There is an ongoing debate in our lab about the issue of subsampling. I found it hard to find relevant literature on this in the case of pyrosequencing, although there is plenty of literature regarding general sampling in ecology.

My rationale was an analogy from hydrobiology (almost the same as the benthos example!). You can take a sample by filtering much water which you think will be enough to get even the rare species and then you count the first say 300 specimens of your target group. In the case of sequences resulting from pyrosequencing the analogous approach is: I filter them, create OTU-s, remove contaminants (can only be done after creating the OTUs!) and then randomly subsample once to the lowest meaningful number of sequences among samples.

If you think about the mechanism of pyrosequencing, this approach also makes sense, because it mimics the way the sequenced amplicons are “subsampled” from original amplicon pool by binding to the given number of beads.

I guess the sub.sample command is also without replacement, so it can be used for this approach, or?

Of course in hydrobiology another approach by using the same amount of water and then counting all the specimens in them also exists, in the case you are more interested in the quantities of the common species. Analogously regarding pyrosequencing my colleagues use subsampling with replacement, resampling to the median, and iterated approaches when at the end they use a mean value for abundance.

Any thoughts (papers/literature) on this? Any more reasons why any of these approaches makes more sense than the others?

Part of the problem is that we aren’t exactly sampling everything evenly - the number of reads is not correlated with the amount of starting DNA. In the bad old days we’d sequence 96 clones per sample and we wouldn’t worry about these things. Now, we can have one sample with 1000 reads and another with 10000 reads because of how the pipetting was done during normalization.

When we do things like alpha or beta diversity, we actually subsample 1000 times (each time without replacement) and each time we calculate the desired parameter. Then we calculate the average parameter that we use for reporting the data, ordination, amova/homova, etc. The only problem comes in when you’re interested in doing something like otu-by-otu anova, metastats, random forest because then you can only do a single subsampling. You could do the anova’s 1000 times and make sure you get the same result, but… I suspect that in these latter types of analysis, you are really interested in the OTUs that you’ve seen a decent number of times and their relative frequency is not likely to change much from subsampling to subsampling.

Hopefully people can agree that subsampling is important (how to compare 1000 to 10000 reads?), the debate seems to be over these latter cases of otu-by-otu analysis…

I’ve been trying to think of a reasonable way to do the otu based community comparisons-ie. indicator analysis or random forest-that incorporates subsampling. The best that I’ve come up with, theoretically, is to subsample then average the # seqs per otu. So if something is a singleton and only selected once in the 1000 subsamplings it would have an abundance of .001, which is sort of ridiculous but seems better than a single subsampling where it could get an abundance of 1. I haven’t figured out how to do that sort of repeated subsampling in R, so now am just subsampling a few times, running indicator analysis on those matrixes and seeing which OTUs are consistently showing up.

We’ve thought of that too, but that is the same thing as getting the relative abundance, but will cost you a few more carbon credits computationally speaking. The problem with that as I see it is that in the 1000 vs 10000 case the 10000 community will detect many more OTUs and skew the relative abundances since the 1000 community will never have those OTUs show up.

The probability of occurrence of an OTU in a sub-sampling will be tied both to its original relative abundance and to the size of the sub-sampled community. So I think that one potential solution is to sub-sample n times (without replacement) and keep those OTUs that are selected, say, more than 50% of the times…