I was wondering how distances are calculated with the “search=distance” method for finding k-nearest neighbors; are reported values based on pair-wise alignments after excluding terminal gaps (assuming template sequences are longer than the queries)?
The distance reported is the distance calculated between the query sequence and its closest match in the template. For example:
Would have two mismatches and one gap. The length of the shorter sequence is 10 nt, since the gap is considered as a single position. Therefore the distance would be 3/10 or 0.30. This is the distance calculating method employed by Sogin et al. (1995). The logic behind this type of penalty is that a gap represents an insertion and it is likely that a gap of any length represents a single insertion.