Remove.rare clarification

eleknagy · January 30, 2025, 2:08pm

Good morning mothur maintainers,

I am trying to remove “spurious” reads from a dataset’s .shared file (I acknowledge the arguments for/against this, reviewers gonna review) using the remove.rare command.

I am finding that when using the bygroup=f parameter, often groups within the entire column of a rare OTU can still contain read counts below the desire nseqs threshold, but that the sum of reads across all samples for that rare OTU is nicely above that desired nseqs.

For example, using the full Stability dataset, I run the command:

remove.rare(shared=final.opti_mcc.shared, nseqs=6, bygroup=f)

my resulting final.opti_mcc.0.03.pick.shared file still has read counts of 1 for OTUs that should have been removed. Yet when I sum all of those counts, that sum is 7, which is above that nseqs=6 threshold.

My main question is: Is this the desired effect of the default bygroup=f parameter? To only retain OTUs that sum up to the desired nseqs throughout all groups?

When using bygroups=t, the resulting pick.shared file indeed only keeps the read counts per sample above that threshold. This is somewhat what I desire, however then I get into a great internal philosophical debate over “well are reads below that threshold truly ‘spurious’ if they could be deemed “spurious” in one group and not another?” and I just waste time doing some (I think) clever list.otus of the original and “cleaned” shared files, a grep -v command to isolate “bad” OTUs from the original shared file, and a remove.otus command to remove those bad OTUs that are truly below the threshold. I find all of that a satisfying challenge, but can often fail to put in in a publication-quality wording and/or to rationalize it to a potential reviewer.

I do hope that my question/line of thinking makes sense here.

Thank you for your time,
Elek

pschloss · January 31, 2025, 1:51pm

It sounds like remove.rare is doing what it’s supposed to. If bygroup = F, then it looks at across the sample and removes any OTU whose total abundance across all samples is below the threshold. If bygroup = T then it looks at the OTUs within each sample and turns them to zeroes if they are below the threshold.

If you want to get more existential (and perhaps invest more in pushing back against the reviewers), then you probably should be running this on the sequences before doing all of the clustering and what not. The justification for removing rare/spurious sequences is that they are sequencing artifacts. So, I’d remove the rare sequences, not the rare OTUs. It might not make a big difference for a threshold of 2, but it would for higher levels. Then there’s that question - what threshold do you use and why? A singleton sequence from a sample with 10^3 sequences would likely have an abundance of ~10 in a sample with 10^4 sequences. Yet the same threshold gets applied across all samples, regardless of their abundance. If you want something to cite to a reviewer, you might give them this preprint, which the reviewers didn’t like because I used rarefaction. Three rarefaction papers later, I’m ready to get back to the rare sequence paper

Pat

eleknagy · January 31, 2025, 6:24pm

Thank you for the reply Dr. Schloss. Lot’s of food for thought!

As for removing rare sequences prior to clustering, do you propose removing them prior to the pre.cluster, or the cluster step? I would tend towards the pre.cluster step but perhaps that would be too “harsh” on some reads that could only be considered like a SNP due to slight sequencing error.

As for actually performing this removal, would one just use the count_table before the intended clustering step to inform them as to which sequences are rare and then remove.seqs them from the .fasta and .count_table and then proceed with the clustering step?

Thank you for your thoughts and time,
Elek

pschloss · February 6, 2025, 2:00pm

If I had to do it, I’d do it before cluster/cluster.split

Pat

eleknagy · February 6, 2025, 6:21pm

Or perhaps prior to the dist.seqs step to avoid calculating distances between sequences that we’re just going to throw away anyway?

Thanks again Pat,
Elek

pschloss · February 6, 2025, 6:34pm

Yep before the distances whether its with dist.seqs or cluster.split
Pat

system · February 16, 2025, 6:35pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
remove.rare v.1.31.0 mothur bugs	1	2546	July 31, 2013
remove.rare Commands in mothur	1	1924	December 17, 2013
Removing rare reads by percentage? Commands in mothur	2	2559	July 8, 2013
remove.rare for groups and names? Commands in mothur	0	3190	July 9, 2011
Using split.abund with bygroups=T? Commands in mothur	0	3199	July 19, 2011

Remove.rare clarification

Related topics