question on classify.seqs using knn algorithm

Hello all,

I was wondering how distances are calculated with the “search=distance” method for finding k-nearest neighbors; are reported values based on pair-wise alignments after excluding terminal gaps (assuming template sequences are longer than the queries)?

Thanks much.
pad

The distance reported is the distance calculated between the query sequence and its closest match in the template. For example:

SequenceA ATGCATGCATGC
SequenceB ACGC—CATCC

Would have two mismatches and one gap. The length of the shorter sequence is 10 nt, since the gap is considered as a single position. Therefore the distance would be 3/10 or 0.30. This is the distance calculating method employed by Sogin et al. (1995). The logic behind this type of penalty is that a gap represents an insertion and it is likely that a gap of any length represents a single insertion.