Rarefaction or sub-sampling?

Dear mothur enthusiasts,

I have a question arising after rejection of a manuscript. The editor asked us whether “the samples were rarefied prior to downstream analysis”.
I know about rarefaction curves and the help they provide in estimating the coverage of the total diversity, but I have the impression that this is not what he/she means by “rarefy the samples”. Could this mean “sub-sampling”, so that all samples have the same number of reads (i.e. adjusting to the sample with the lowest number)? Is it correct to use the command sub.sample in that case (with the option persample=T), or is there another command that is especially for rarefying?

And at what step should this be done? I would do this after all quality-control steps (assembly of contigs, search for chimeras, but also removal of Plasts, Mitochondria…), but before calculation of the distances and clustering.

Thank you for your advice,

sub-sample can mean selecting some samples, while rarefaction means select equal number of observations per sample. In mothur sub-sample and rarefaction mean the same thing.

If you followed the SOP, you would have subsampled when calculating alpha and beta diversity (this is subsampling many times). Additionally, if you are making things like heatmaps, make sure that you subsample a shared (this is subsampling once).

Dear Kendra,

Thank you for your message. I had initially not calculated alpha or beta diversity, but now I think I will. What are the pros and cons of either using the rarefaction option of the specific commands such as phylo.diversity(), or instead of sub-sampling once for good? Of course, sub-sampling in this case should be done only after all steps that are likely to remove sequences from the dataset (because otherwise the numbers of sequences per sample would change again)?


subsampling repeatedly is good for diversity measures because you never know when you’re going to get a weird subsample (it’s a random process). I never do the phylotype analyses in mothur so can’t help with those.

There’s a subtle, but very important, difference between subsampling and rarefaction. Subsampling will draw a desired number of sequences from each sample - once. This is what we do for things like metastats or machine learning models. For alpha and beta diversity, we rarefy the data. This is subsampling 1000 times and averaging the 1000 subsamples. With the large number of randomizations, you average out the effects of getting lucky/unlucky. This approach is demonstrated in the MiSeq SOP wiki page.



I am very happy that the forum is back online because I am still confused…

I understand rarefaction as a way to test whether the bacterial groups that were evidenced, using the total number of sequences obtained, would also have been evidenced with less sequences (in other words, have we seen everything that is present or should we have sequenced more). This is the rarefaction curve, that can reach a plateau or not. And this greatly depends on the chosen taxonomic level (I did a quick run and at 97% I did not get saturation whereas my colleague did it at 95% and saturated, which would mean that we have seized all genera but not all species).

But I don’t understand what the editor suggests as “rarefying the data prior to downstream analysis”, because in our manuscript we did not reason in terms of presence/absence but in abundance %. Indeed, to rarefy, one calculates the average of a high number of randomizations (typically, 1000, or more). I can’t imagine that the abundance % of the groups would be any different after rarefaction from what we get from the entire dataset (because it is averaged and lucky/unlucky draws all make up for each other, converging to average values equal to the overall abundance %).

Furthermore, in our analysis, we produced a tree and then the UniFrac matrix (unweighted version) that was later used in the multivariate statistical analyses. But to date I can’t figure out, how to obtain a “rarefied dataset” from which we could produce a tree, then the UniFrac matrix and from that re-do our statistical analyses… Initially I had thought of extracting a reduced sequence set in order to have the same count for all samples (command sub.sample), but as Pat pointed out, this is not rarefaction but only a random draw. Should I use the subsample=T option of unifrac.unweighted to that purpose? The “reduced tree” will not be output but produced internally at each randomization (and I don’t really need it to be saved).


As a follow-up:

I have run the command:

unifrac.unweighted(tree=bf_tree, group=bfr.contigs.pcr.good.pick.groups, distance=square, processors=4, random=F, subsample=T, iters=1000)

And I got those files:
bf_tree1.unweighted.phylip.dist -> seems to be identical to the matrix obtained without subsampling option
bf_tree1.unweighted.ave.dist -> averaged matrix?
bf_tree1.unweighted.std.dist -> standard deviation?


For the unifrac distance matrix you want the ave.dist file, which is the average of the iterations.


Hi Pat,

Thank you for the confirmation. I am making slow progress towards integrating rarefaction (or sort of) into our workflow, which included NMDS based on the UniFrac matrix, but also CCA based on Bray-Curtis distances.

I was surprised, however, that the results of count.groups(group=xxx.group) are from 8963 to 164867 and the subsample size for unifrac.unweighted was set to 2837. What does this correspond to?


I am very intrigued by the output of unifrac.unweighted with the subsample option. When I do


I get

S13 contains 8963. [i](lowest number)[/i]
S44 contains 164867. [i](highest number)[/i]

But then I do

unifrac.unweighted(tree=bf_tree, group=bfr.groups, distance=square, processors=4, random=F, subsample=T, iters=1)

and I get

Setting subsample size to 2837.

This is coherent with the fact that if I set subsample=2500 then everything runs fine and if I set subsample=3000 then I get the message “You have selected a size that is larger than S13 number of sequences, removing S13.”
Is it possible that some information in the tree indicates lower numbers of sequences than in the group file? Is there an easy way to count how many sequences of each group are present in the tree? Is there a way to know how many sequences the unifrac.unweighted “sees” in each sample?

Have you tried count.groups?

Yes, I have used count.groups on the same group file as the one used for unifrac.unweighted. For the sample with the lowest count, count.groups outputs 8963 whereas unifrac.unweighted subsamples everybody down to 2837 sequences. Or perhaps you mean use count.groups on the tree, is this possible (according to the wiki, this doesn’t seem to be…)?
PS: the tree was not produced by clearcut (I did not manage to use it) but by FastTree. Could this be the cause of the problem?

Yeah, that’s possible. Have you tried using clearcut?

I don’t remember why exactly, but I did not manage to produce my tree with clearcut (I think it just crashed).

I tried again to make the tree with clearcut (based on the aligned fasta file), to no success:

mothur > clearcut(fasta=bf7.fasta, DNA=T, verbose=T)
PRNG SEED: 1479731242
Clearcut: Memory allocation error in NJ_compute_dmat()
Clearcut: Error computing distance matrix
Clearcut: Failed to build distance matrix from alignment.

Unfortunately, no information is to be found in the mothur logfile. The program runs for 10-20 minutes, using 100% of only one CPU core and RAM is far from being saturated (although my PC only has 32GB).

I suspect the dataset is too big, but you might try following the example in the MiSeq SOP where the distance calculation is done from within mothur.


Pat, I don’t understand your message. First I tried to do everything from within mothur, but I failed to get the tree from the mothur-generated alignment (using clearcut). That’s why I turned to FastTree just for that step (back to mothur afterwards). But then I got the discrepancy between the number of sequences to which unifrac.unweighted downsamples everything and the count.groups estimate.

The error message suggests that you did not run dist.seqs as described in the MiSeq SOP:


As I mentioned, your dataset is probably too large to build a tree and run unifrac. I’d encourage you to use OTU-based methods instead.


Hi, Sorry for reopening this old thread but I am a beginner to microbiome analysis and there is an
aspect of rarefying that I am still confused on. Apologies for the very basic question!

Would I be correct in saying that our subsampled file is the file that is being rarefied? Or is the “raw” unsubsampled file that is being rarefied?
And then the rarified file is being used for all alpha and beta diversity analysis?

Thanks in advance!

it’s the raw file. the output from sub.sample will have the same number of reads in each sample as does the rarefied file. the difference is that sub.sample does one subsampling while rarefaction does 1000.