Dealing with chimeras

Hi, I’ve been doubting about asking these questions/issues for a while now.

As I get different results using different algorithms, I was wondering what’s the right - wait!- most correct way to deal with this? Not only do the number of apparent chimeras differ, the overlap is generally not too high (same sequences assigned chimera by different algorithms). So, combining the output increases the amount of sequences to be culled.

Perhaps I’m too picky (can this even be?), but investigating diversity in quasi pristine, remote areas, I’d like to do it as correct as possible. I don’t want to inflate the diversity (or abundances!), but I don’t want to loose (too much) information, either. Removing low abundance sequences is not an option (and indeed, chimeras do not necessarily have low abundances).

Testing Uchime (de novo + Silva), Perseus and Decipher results in the following:
Uchime: 471
Silva: 284
Perseus: 577
Decipher: 666

combining yields unique chimeras (didn’t do all combinations, but you get the idea):
U+S: 661
U+P: 701
U+D: 978
U+P+D: 1151
D+S: 867
S+U+P: 865
S+U+P+D: 1304

out of 6126 unique seqs. Didn’t try Chimera Slayer either. Still have to test the impact of these combinations on the total retained sequences and OTU abundances)

So as chimeras are such a problem, isn’t even more scrutiny needed? Tests on mock/artificial (digital) comunities removing up to 90 % of chimeras is one thing, but isn’t this huge difference in positives worrying?
After going through all the trouble by denoizing, removing low quality seqs, …, the choice of a chimera detection algorithm can still dramatically impact your results, as it appears.

  1. How many and which algorithms to combine? Where does it end?
  2. Is manually controlling them to avoid removing too many false positives an option? How to do this?

As I dived into the pyrosequence thing without any previous hands-on experience with Sanger sequences, manual aligning, chimera checking, … I am somewhat handicapped, not having the insights most of you have, limiting my view on things. So if I’m wrong, please correct me.

Any of you having any experiences with DECIPHER? Combining Uchime and Decipher should remove 89 % of all chimeras (according to their datasets, of course).

And while we’re at it, you (Pat) advises still to check the alignment manually. Can you reccomend a program that is able to deal with these amounts of sequences (and MB/GB)? BioEdit, CLustalX, Mega all crash …


Thanks for any input! :geek:

Oh, and classifying these positives (just testing to see if I could detect some trends) yielded bootsrap values of 0-100. These latter could of course be chimeras from closely related sequences. Perhaps manually checking these positives should focus on these sequences with high bootstrap values? But again, where to draw the line (80 %? 90 %?).

thanks

Hey Kirk,

I feel your pain. I would stick with Uchime, slayer and perseus. Based on our testing and the original authors’, they really are best when used de novo. I haven’t been too impressed with decipher because of it’s database dependence and questionable performance with short sequences. The key is the tradeoff between false positives and negatives. We can change the settings on all of these to remove all of the chimeras, but we’ll of course lose other good stuff as well. I think the best thing would be to use multiple methods on a vetted set of non-chimeras and see how the false positive rate goes up as you add more methods and then go from there. I think this approach would be better than using mock communities. Maybe I should do this…

Pat

Hi Pat,

Let me start off my openly disclosing that I am an author of the DECIPHER paper. Also, I want to mention that I admire the effort you have put into offering various chimera removal algorithms as part of mothur. Hats off to you.

Kirk,

I can fully appreciate your dilemma when deciding which chimera program to use, and just like Pat, I feel your pain. We created DECIPHER specifically to help mitigate the need to remove a lot of your sequence set just to catch a few real chimeras. If you have full-length sequences then you can use the fs option and the error rate is by far the lowest of any of the chimera programs. We calculated the false positive rate to be 0.1% of sequences using about 7,000 type-strains that we assumed to be non-chimeric.

The difficulty with the fs option is that it does not catch enough chimeras in short-length sequences (< 1,000 nt), which are common now because of next generation sequencing. If you have short sequences, my recommendation would be dependent on what type of chimera you are most concerned about. If you are afraid of chimeras formed from closely related sequences (< 10% distant 16S) then you should use Uchime based on our testing. In my opinion, Uchime is very similar to ChimeraSlayer algorithmically, but has a lower false positive rate and catches more chimeras. I am less familiar with Perseus because I have not tested it thoroughly yet, so I will withhold comment.

If you have short sequences but are most concerned about “bad” chimeras (with parents > 10% distant 16S) then you have several options. Uchime will still catch a fair number of these bad chimeras. The ss option with DECIPHER will catch most of these “bad” chimeras based on our testing. I recommend that you try using both of these programs, and then decide which result you prefer (both methods provide detailed outputs that you can look at more closely). Both programs have similar false positive rates (~1-2%), or you can combine their results and you will get a slightly higher false positive rate (~3%). Pat mentions that DECIPHER has “questionable performance with short sequences” and I am interested in what evidence he has to this regard. In our defense, I will say that several of our online users have gone out of their way to thank us for designing an algorithm that performs so well with short sequences.

I can understand why Pat would struggle with what he calls our “database dependence” because I imagine it makes it more difficult to integrate our algorithm into mothur. In reality, all of these algorithms are database dependent because they require a set of (presumably) chimera-free reference sequences. The difference with DECIPHER is that the reference set is very large (~2 million sequences) in order to encompass all of the diversity that naturally exist. We tried to make DECIPHER more accessible by hosting a web-server so that users aren’t required to figure out how to install or run our program.

I hope this helped a little to guide you through the chimera-maze.

Best regards,
Erik Wright

:smiley:

I can understand why Pat would struggle with what he calls our “database dependence” because I imagine it makes it more difficult to integrate our algorithm into mothur. In reality, all of these algorithms are database dependent because they require a set of (presumably) chimera-free reference sequences. The difference with DECIPHER is that the reference set is very large (~2 million sequences) in order to encompass all of the diversity that naturally exist. We tried to make DECIPHER more accessible by hosting a web-server so that users aren’t required to figure out how to install or run our program.

Eh, that’s not that big a problem for us - we’ve certainly implemented a number of other db-dependent methods (e.g. ChimeraSlayer, RDP’s Bayesian classifier, RDP’s old ChimeraCheck). Our experience has been that de novo chimera detection (e.g. ChimeraSlayer with our mods, UChime, Perseus) do better than the database-based methods. Also, we run into a number of people that use things other than 16S or that use things that don’t target bacteria and for which there aren’t very good databases out there. I’ll have to give DECIPHER another look.

Thanks for the intro and if you have code you wouldn’t mind sharing we’d be happy to incorporate it into mothur!
Pat

Hi Pat,

Glad to hear that you are interested in incorporating DECIPHER’s Find Chimeras program into mothur. All of the source code is available from our website, and you are welcome to integrate it into your software. However, if you do not feel this is feasible then people can always use our web-server to check their sequences for chimeras:
http://DECIPHER.cee.wisc.edu/

Also, as you previously suggested, it would be excellent if someone more independent such as yourself went through the process of contrasting all of these different algorithms on various sets of test sequences/chimeras. We tried our best to accomplish this unbiasedly when publishing DECIPHER, but as Kirk pointed out, it was “according to [our] datasets, of course”. Although we also showed the results using other program’s datasets, there is only so much we can do as authors to convince the reader.

Thanks,
Erik