chimera.seqs

I’m having trouble figuring out exactly what to do with the chimera.seqs command. Specifically, what should the template file be? I tried using the same file that I used for the alignment (greengenes alignment database), but got the following ‘result’ that I’m not sure how to interpret:

F567RGU01DU0NL div: 89.2086 stDev: 24.6421 chimera flag: Your template does not include sequences that provide quantile values at distance 90
Observed 71.1538 57.6923 55.7692 51.9231 48.0769 57.6923 57.6923
Expected 31.4379 40.7902 41.9426 54.5486 76.4073 77.4367 77.4367


Can anyone offer some insight? Thanks!

I should have noted that this was using the Pintail method.

F567RGU01DU0NL div: 89.2086 stDev: 24.6421 chimera flag: Your template does not include sequences that provide quantile values at distance 90
Observed 71.1538 57.6923 55.7692 51.9231 48.0769 57.6923 57.6923
Expected 31.4379 40.7902 41.9426 54.5486 76.4073 77.4367 77.4367

The div:89.2086 means that the closest match in the template to sequence F567RGU01DU0NL had a similarity of 11%. This may indicate that your sequence is backwards. You could try using the reverse.seqs command to reverse the sequence and run it again. In order to determine whether a sequence is chimeric the pintail algorithm compares the observed scores to expected scores over various windows across the sequence. To give meaning to the scores generated, we look at the scores generated by known non-chimeric sequences(the template sequences) with the same distances from each other. If your sequence scores are above the 95% of the scores we get from the template sequences, we report the sequence as chimeric. The “Your template does not include sequences that provide quantile values at distance 90” means that the template did not have any sequences that had distances of .90 from each other. Distances are rounded up, so .892086 becomes .90.

Oops… I guess I was using a fasta file that had gone through ‘filter.seqs’, while my template still had all of the gap positions - so that explains the low similarity.

I have the same result.

F7767RN04I92TX div: 84.7826 stDev: 84.7826 chimera flag: Your template does not include sequences that provide quantile value
s at distance 85
Observed 83.3333 79.1667 77.0833 85.4167 87.5 91.6667
Expected 84.7826

Does the chimera check (Pintail) need query to be full length of 16s rRNA? Current titanic 454 reads are only about 400 nt.

Are your query sequences aligned and the same length as your template sequence? The results indicate a very bad match between your query sequence and the closest sequence in your template.

You should be able to filter your sequences and then take the .filter file and apply that (filter.seqs(hard=.filter)) to the template file. Then give both to mothur.

Thanks, I did the hard filter and got the chimera check works. Thanks~

So, the steps I used

  1. align to template
  2. screen the alignment if needed
  3. filter alignment result. This step will generate *.filter file.
  4. hard filter the template by *.filter file from step 3
  5. chimera.seqs(fasta=data.filter.fasta, template=silva.bacteria.filter.fasta,method=pintail)

hi,
I’ve done a unique.seqs, align.seqs, and filter.seqs on my pyrosequencing dataset. I’m left with 70 5012 sequences in my dataset and trying to run chimera.seqs method=bellerophon. I’ve been doing a few test runs with this command (using only 1000 sequences) and get the following output:

Sequence with preference score above 1.0: 389
Minimum: 0.583426
2.5%-tile: 0.705179
25%-tile: 0.815585
Median: 0.926245
75%-tile: 1.09884
97.5%-tile: 1.97767
Maximum: 2.77009

This seems odd that 389/1000 sequences have a suspected chimera…also I would need much more than 16GB of memory to run my whole dataset. Can anyone provide me with some insight?
thanks!

This may be the reason that bellerophon is not widely used at this point - a lot of false positives. I suspect the really high values are probably chimeras. I’d suggest moving on to pintail.

Alternatively, we’ve almost finished implementing a new method, ChimeraSlayer, from the folks at the Broad Institute. They seem to have very good results with this (much better than the other methods) with long and short reads. Also, I know that people are working on developing better methods to detect chimeras in pyrotag data (yes, they’re in there) that would take into account the abundance of a sequence - the dominant sequence will probably not be a chimera.

You might also consider increasing the cutoff definition for a chimera. The implementation of Bellerophon 3 on the GreenGenes server uses a divergence ratio threshhold of 1.1. Using this cutoff, for instance, my number of chimeric sequences in one sample dropped from 238 to 124 out of 567 initial clone sequences. In other cases, it salvaged fewer seqeunces – from 111 down to 94 out of 346. But I would agree, these still seem like high percentages.

My experience with ChimeraSlayer so far is that it is fairly ram-hungry, godawful slow with default settings, and not multi-threaded. On a 4GB Win32 system, it will slowly rise to 1,966,624kb of RAM usage (the maximum allocation under Win32) and then print “Out of memory.” Mind you, this is the PERL interpreter saying this, not the program itself – which hums along as if everything is still copacetic. Doing what? I have no idea.

On a 16GB 64-bit linux box, the program will grab 5.1GB of RAM and hold it for the duration of its multi-hour run (vs the gg core set) – at the rate of 1 sequence analyzed every 9 minutes. That’s minutes per sequence, not sequences per minute – completely unusable on larger datasets. My hat’s off to Pat and Sarah if they can make it fly!

To add insult to injury, to use ChimeraSlayer most effectively, one really should run it twice. Once against a database, and again against the input sequences themselves since a great deal of sequence diversity simply isn’t in any database yet.

I’m looking forward to seeing chimer-checking methods that take abundance into account. It does seem natural to think that more abundant sequences would more likely contribute to chimera formation.

Robin

I just had a thought (yes, yes, I know)…

If the exact same sequence is seen more than once, it really can’t be a chimera, can it? I mean, one really doesn’t expect the polymerase to fall off at exactly the same place over and over again, right? And because we typically use uniq.seqs before chimera.seqs, we may again be falsely boosting the positive count. But if the SNP density were low, I guess it would be impossible to pinpoint the precise breakpoint, so maybe throwing all the babies out with the bathwater is the right way to go. Or perhaps use a threshhold. If a I see a sequence 5 times, it’s not chimeric?

Robin

Hi,
I was trying to use the chimera.pintail command as in the mothur tutorial (mothur > chimera.pintail(fasta=filename.align, template=templatename.fasta), but it kept telling me it was an invalid command. When I used “mothur > chimera.seqs(fasta=filename.align, template=templatename=fasta, method=pintail)” the chimera check proceeded. Perhaps it would be good to note this in the tutorial?
Thanks,
Katy

Katy - we actually changed the command names in the last release. Can you double check that you have v.1.9? It should be chimera.pintail now.

Yes, it looks like I did have an old version of Mothur. Thank you for your help, I would have been stuck on these chimera checks for weeks.