reference sequence alignment

This is a general consideration on the quality of reference guided sequence alignments. I am seeing some wild behaviour in alignments that -from my point of view- is understandable, but makes me question if that is the best method.

My point is that, when one uses a reference alignment, and relies on a database for that, it may happen that distances become artificially too large. Let me explain: if I am doing taxonomical analysis of a soil, I cannot predict easily in advance which range of microorganisms I’ll find. This implies I need to check against a wide range of potential aligned targets, e.g. an alignment of all bacterial 16S sequences.

Now, to build such an alignment database one has to match very divergent sequences, hence resulting in, fully justifiably, many indels, some of them potentially very long. If one where to make an alignment of closely related 16S sequences one would then naturally expect less and shorter indels.

So, when I come up with my soil sample, of which I know little in advance, I must start by building an alignment. If I use a general reference alignment of 16S sequences, then I can expect: 1) that my sequence is aligned against a reference with many and long gaps, and 2) that chances that the reference selected to guide my sequence alignment is the wrong one increase (more mutations, more chances that on building the alignment it may match more potential targets).

Hence, one would expect that reference guided alignments may introduce too many gaps (which might be dealt with by filtering out common gaps except for: ) and more misaligned sequences (which would force retaining wrong gaps in the filtering process), hence resulting in artificially bigger distances.

Does this make any sense at all? Might that influence the subsequent analysis result with mothur?

All subsequent analysis will be affected by the quality of the alignment. Please correct me if I am wrong. Here is my recent experience.

While analyzing my synthetic community mix I aligned the known full length sequences using SILVA and then used that alignment file as a reference to align my pyro-tag sequencing for the synthetic community. I noticed that the quality of alignment (as indicated by insertions and similarity of modified query sequence and template after alignment) was much better compared to when I was using the seed alignment that is provided to align the pyro-tags. Also, when I look at each OTU generated from an alignment file that consists of pyro-tags + chopped (to similar length as pyro-tags) synthetic community sequences together (aligned against SILVA alignment of full length sequences of known synthetic community members), a larger portion of sequences seem to be binned with the known synthetic community sequences. This is compared to if the same library (consisting of pyro-tags and chopped synthetic community sequences) was aligned to the provided seed database.

I wish there was a way to add sequences to the reference alignment provided. Is there a way to do that, so as to create a personalized reference template? We have many good quality full length sanger sequences from past clone libraries of environments that are currently being subject to pyro-tag sequencing and it would be nice to be able to add them to the provided seed alignment template. This would require first checking if the sequences to be added to the alignment file (or a close relative) is not already part of the template. But I am not sure what the next steps would be.

Ameet - Yes, that’s the whole point. There is nothing mysterious about the reference alignment. It’s just a SILVA-formatted fasta file. You can open it and concatenate any sequences you want to it. This is something you can’t do with the SILVA, gg, or RDP aligners.


I’d encourage you to go back and re-read the aligner paper we published a year or two ago. There are a couple of basic steps to the process. First, we find the closest reference sequence by kmer searching - this is alignment independent. Second, we do a pairwise alignment of the query to the reference sequence. Third, we re-introduce the gaps that were originally in the reference to the query so the alignments are the same length. So, I wouldn’t worry about introducing too many artificial insertions. In fact, the vertical-filtered alignments have fewer gaps than using something like MUSCLE or CLUSTALW. If you are getting wonky alignments you have several options. You could improve your reference by supplementing the references with well-aligned sequences similar to what you are interested in and having difficulties aligning. You could also fiddle with the kmer size. You could also open the poorly aligned sequences in ARB and correct the alignment.

Hope this helps,

Thanks very much Pat.

Pat, thanks for the feedback. I do understand the underlying reasoning, but it still leaves me with an uneasy feeling.

When one has many thousands of unknown pyrosequencing sequences it easily becomes unfeasible to hand correct and alignment. I guess I will have to play with k-mer size and look for a better (more restricted) reference dataset, but then again it is difficult to be sure one is not introducing artifacts when selecting it.

In fact, I am working with 16S and using Silva as a reference. What I see is that mothur includes more gaps (even with preclustering and filtering), both with short V6 and nearly full length sequences, which might be a dataset related problem.

That is puzzling us, as we expected better alignments. I have made some checks and it certainly seems to depend on the dataset. Next I will try changing the kmer size with a controllled mock dataset. If you have any more suggestions, they’ll be most welcome.