Dereplicate in chimera.uchime

In the SOP, there is a comment regarding chimera.uchime (with reference=self) that is strange to me.
“By default, if chimera.uchime calls a sequence as chimeric in one group, it considers it a chimera in all samples and will flag all for removal.”
I don’t see the logic behind this approach. The detection of chimera by uchime is based on probabilty. Therefore, a sequence is flagged as chimera if we find potential parental sequences amongst the more abundant sequences in the same sample. Let’s call it seq1. However, if the same sequence exists in a sample without the corresponding parental sequences, it is an evidence that this sequence is not a chimera. Let’s call it seq2. Therefore, not only seq2 should not be labelled as chimera, but the seq1 should not be labelled as chimera neither. Do you see my point?
In mothur, it seems to me that this approach cannot be applied. It is not possible to “unlabel” seq1 based on evidence that seq2 is not a chimera.
I would be curious to know the point of view of others on this. Would it be possible to apply this approach in mothur?
In advance, thanks for sharing your thoughts.

First off, this is something mothur is doing, not UCHIME. We apply UCHIME to each sample and flag sequences as appropriate. This generates a list of names of putative chimeras. The question is what to do with that list. The argument in favor of removing a chimeric sequence from all samples is that a non-flagged sequence might have rarer parents and that it’s a matter of “power”. Therefore, it would make sense to remove the sequence. The argument in favor of only removing flagged sequences from samples where they’re flagged is that we aren’t making any extra assumptions. Your question is a very good one and introduces a third option - if it’s not flagged in every sample, it’s not a chimera. I’m not sure what to think about this and it is very provocative.

Here’s are a couple of simulations someone should try…

  1. Take a bunch of mock community samples where we know the true chimeras. If options a, b, or c are used, what are the false negative and false positive rates?

  2. Take a very large dataset where a single sample was sequenced excessively. Call chimeras on the entire dataset - let these be the true chimeras. Then, randomly draw sets of 5000 reads from that dataset without replacement and run the subdatasets through UCHIME and flag the chimeras. Get the false negative and false positive rates for options a, b, and c.

Sounds like a PLoS ONE paper…


Thanks for this motivating answer…
Before convincing a colleague of mine to do this work (and to put my name as last author on the PLoS One paper), I would be interested in testing the result of this approach on my current dataset. Would it be possible to do it with remove.seqs or some other commands in mothur?

Hey it’s better than J Crap Sci :wink:

remove.seqs would remove sequences that you define. To split up a file you might write a script in R/Python/Perl/Assembly to randomly select sequences using the count table or fasta/names file.