Good morning mothur maintainers,
I am trying to remove “spurious” reads from a dataset’s .shared file (I acknowledge the arguments for/against this, reviewers gonna review) using the remove.rare command.
I am finding that when using the bygroup=f parameter, often groups within the entire column of a rare OTU can still contain read counts below the desire nseqs threshold, but that the sum of reads across all samples for that rare OTU is nicely above that desired nseqs.
For example, using the full Stability dataset, I run the command:
remove.rare(shared=final.opti_mcc.shared, nseqs=6, bygroup=f)
my resulting final.opti_mcc.0.03.pick.shared file still has read counts of 1 for OTUs that should have been removed. Yet when I sum all of those counts, that sum is 7, which is above that nseqs=6 threshold.
My main question is: Is this the desired effect of the default bygroup=f parameter? To only retain OTUs that sum up to the desired nseqs throughout all groups?
When using bygroups=t, the resulting pick.shared file indeed only keeps the read counts per sample above that threshold. This is somewhat what I desire, however then I get into a great internal philosophical debate over “well are reads below that threshold truly ‘spurious’ if they could be deemed “spurious” in one group and not another?” and I just waste time doing some (I think) clever list.otus of the original and “cleaned” shared files, a grep -v command to isolate “bad” OTUs from the original shared file, and a remove.otus command to remove those bad OTUs that are truly below the threshold. I find all of that a satisfying challenge, but can often fail to put in in a publication-quality wording and/or to rationalize it to a potential reviewer.
I do hope that my question/line of thinking makes sense here.
Thank you for your time,
Elek