How does align.seq's SimBtwnQuery&Template handle gaps?

Hello all -
I am currently using mothur to compare a set of environmental 16S sequences to database reference sequences. I’ve come to the conclusion that different methods can make quite important differences here. I think that aligning to a proper reference alignment (e.g., one built w/ secondary structure information) is critical to this.
I’ve run mothur’s align.seqs and am really interested in the SimBtwnQuery&Template column in the report results. The documentation says that this percentage “includes gaps” - however, I might be missing some key information on how it counts gaps occurring in the alignment comparison between the query sequence and the best hit template sequence -

  1. Does it count terminal gaps?
  2. Does it count a gap only once (comparable to ‘onegap’ in dist.seqs) or does it count each position of gaps (comparable to ‘eachgap’ in dist.seqs)?

Apologies if this information is already available somewhere and I have missed it.

Also wondering out of curiosity if anyone has any additional insight as to whether this column is trustworthy, or which of those options above are best for comparing environmental 16S sequences to databases. Personally I would choose to not count terminal gaps and count ‘onegap’ (each true indel is likely 1 evolutionary event), but I guess opinions can differ here.

And is there any way to approach this question with mothur besides the above? The other thing I was thinking of is pulling in all of the alignments of the best hits reported above with the environmental sequences, running dist.seqs and then pulling out just the relevant comparisons, but this approach seems super roundabout.


OK, I realized I could quickly test this questions with some simulated data:

  1. No, terminal gaps are not counted in this similarity.
  2. Gaps are counted according to their length (IE, the ‘eachgap’ method explained in dist.seqs).

I’ll keep this information out there in case someone else in the future needs it, or if there is anything else worth discussing from the above with the community.

This is also described with examples at the dist.seqs wiki page.


This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.