I’d like to generate the majority consensus taxonomy using classify.otu with the groups option, but only classifying OTUs that appear a minimum number of times. IE, only work with OTUs that appear a min. of 5 times in at least one group, or only classify the top 100 OTUs. Is there a way to subsample the the list, name, and groups files based on abundance that are used in classify.otu, or use a subsampled shared file?
I think removing rare OTUs is a bad idea. You will alter the distribution of the communities, which will disproportionately affect the samples you have more reads from. Furthermore, because most of the metrics depend on the distribution you will be introducing a bias into your analysis. If you really want to do this, I would suggest using an R script to split out the OTUs.
I have one sample that seems really strange: one replicate has >800 OTUs (every other sample has <400). Most of these 800 OTUs are singletons. The number of sequences are relatively evenly distributed between samples. It’s likely a sampling artifact (very low biomass samples, physical replicates difficult to collect). But I’m having trouble understanding the Good’s coverage results: If this high number of OTUs really does represent the diversity of the environment, shouldn’t I expect to have lower coverage values for the other samples with much lower numbers of OTUs? Coverage values range from 91% (the sample with >800 OTUs) to 98% (most of the other samples). If coverage estimates what percent of the total species is represented in a sample in which Coverage = 1 - (number of OTUs sampled once / total number of individuals (OTUs?)), shouldn’t samples with a higher number of singleton OTUs have a higher coverage? I guess the total number of OTUs is increased relative to singleton OTUs in that sample as well. Chao1 is also very high, which makes sense since a high proportions of singletons will inflate Chao. But if the number of unsampled species in cases of undersampling can be roughly estimated using the Chao1 estimator, does this imply that a high proportion of singletons are an indicator that the environment was undersampled? This seems to make intuitive sense, however, in practice shouldn’t the physical replicate that only has 313 OTUs have a much lower coverage and obviously undersampled in comparison? Is there a way to calculate alpha diversity statistics taking paired replicates into account?