Thank you in advance for always responding to my posts with unduly calling attention to my lack of mathematical/statistical/programming knowledge…
With that in mind, I’m wondering if this would be appropriate in subsampling groups? In one dataset I have 26 samples from 8 reactors which were inoculated from different sources (2 uninoculated, 2 inoculated from source A, 4 inoculated from source B). Sequence abundance in the samples ranges from about 4,000-40,000. I ran the 454 SOP without subsampling, and after NMDS and PCoA I looked to see which OTUs were statistically significant in describing variance among samples. Most of these were my most abundant OTUs, but there were several rare OTUs that also showed significance. Going back to my taxonomy file, the phylogenies of these OTUs make sense in terms of explaining variance. For instance, the OTUs that explained the differences in reactors inoculated from A versus B are sequences we would expect to be in the A inoculum or the B inoculum, but not in both.
Ok, so then I go back and subsample and I lose some of these OTUs, which is understandable when you’re pulling 4,000 sequences out of 40,000. If I have an OTU that only has 1 or 2 representatives, I’m not as concerned because these do not significantly contribute to variance among the samples. However when I lose OTUs that have 100 or so sequences then I am losing OTUs that do significantly contribute to the variance. So I was wondering if there is a way to subsample more than one time and then create a representative subsample from these iterations? I feel like this may bias the sample even more towards abundant sequences though.
Maybe this is less of a feature request and more the confused mutterings of a wannabe bioinformaticist.