I am very happy that the forum is back online because I am still confused…
I understand rarefaction as a way to test whether the bacterial groups that were evidenced, using the total number of sequences obtained, would also have been evidenced with less sequences (in other words, have we seen everything that is present or should we have sequenced more). This is the rarefaction curve, that can reach a plateau or not. And this greatly depends on the chosen taxonomic level (I did a quick run and at 97% I did not get saturation whereas my colleague did it at 95% and saturated, which would mean that we have seized all genera but not all species).
But I don’t understand what the editor suggests as “rarefying the data prior to downstream analysis”, because in our manuscript we did not reason in terms of presence/absence but in abundance %. Indeed, to rarefy, one calculates the average of a high number of randomizations (typically, 1000, or more). I can’t imagine that the abundance % of the groups would be any different after rarefaction from what we get from the entire dataset (because it is averaged and lucky/unlucky draws all make up for each other, converging to average values equal to the overall abundance %).
Furthermore, in our analysis, we produced a tree and then the UniFrac matrix (unweighted version) that was later used in the multivariate statistical analyses. But to date I can’t figure out, how to obtain a “rarefied dataset” from which we could produce a tree, then the UniFrac matrix and from that re-do our statistical analyses… Initially I had thought of extracting a reduced sequence set in order to have the same count for all samples (command sub.sample), but as Pat pointed out, this is not rarefaction but only a random draw. Should I use the subsample=T option of unifrac.unweighted to that purpose? The “reduced tree” will not be output but produced internally at each randomization (and I don’t really need it to be saved).