seq.error and chimera detection

Hi!

I am working on a 16S amplicon study where we have amongst others sequenced a mock community to test our method.

I’m using mothur to analyze the data and one of the functions I am using is seq.error(). To assess the error rates and degree of chimeras of the read data generated by PCRing the mock community, I compare the aligned reads (generated with align.seqs, with SILVA as the reference) against an alignment consisting solely of 16S sequences coming from taxa that are present in the mock (basically a subset of the SILVA alignment), with seq.error().

If I understand its chimera detection method properly, it constructs a “database” of all possible chimeras, and checks whether a read is more than 3 bp identical to a possible chimera than to a real sequence. Is this still the case?

Now, I have observed some strange behaviour. By accident, I had included a couple of incomplete 16S sequences in my mock reference alignment. As a result, >90% of the reported chimeras (numparents = 2 in the .error.chimera file) had reported the shortest incomplete 16S sequence as one of the parents of the chimera. In addition, the reported breakpointChi, was nearly always within 10 bp of the coordinate where that incomplete 16S sequence ended, with the majority of reported breakpointChi’s on exactly the coordinate of where that sequence ended.

Then, when I remove the shortest incomplete 16S sequence from the reference, the next shortest incomplete 16S was now nearly always reported as being one of the chimera parents, and the breakpointChi was reported around the region where thát sequence ended.

Finally, when I remove all incomplete 16S sequences from the alignment and/or replace them with complete 16S sequences, the number of reported chimeras drops significantly, but still, a significant fraction of the reported chimeras reports the shortest complete 16S sequence of the reference as one of the parents.

Thus, to me this sounds like that seq.error() reports many reads as chimera’s which may not necessarily be actual chimera’s.

Is this something you are aware of? And how should we go around this problem?

Joran

Sorry, I don’t quite follow what you are doing. seq.error was designed expecting your reference reads would overlap the region you were sequencing fully (not partially). Furthermore, the reference sequences should be the exact sequences you expect to see in your mock community - they wouldn’t necessarily be a subset of silva. These should be coming from Sanger sequence data or genome sequences.

Pat

I see what you mean. In this case, we are not sequencing a region within the 16S, but the end of the 16S gene (starting from a position ~500) with PacBio. I suppose this means that reads coming from longer 16S genes are not fully overlapped by shorter, but full length 16S sequences in the reference alignment. See my attempted drawing to clarify:

>Full length 16S of taxon A
################################################
>Read of taxon A
-----------------###############################
>Full length 16S of taxon B
############################################----

In this example, the read would have a good chance of being reported as a chimera with taxon B as one of its parents by seq.error().

So maybe the issue is that the chimera detector of seq.error() was not designed for this kind of data?

I assumed that SILVA would have the full length, exact 16S sequences of the genomes present in my mock, given that complete genome sequences are available for all my genomes in the mock (although potential contaminations could be present of course)

Joran