Pre.cluster diff number setting based on sequence lenght

Hello! I wanted to be certain that when running the pre.cluster command, I should consider the lenght of my sequences, as its indicated in the SOP:
“We generally favor allowing 1 difference for every 100 bp of sequence”.

If my sequences have a length mean of 416 bps, should I set the diff parameter to 4? I am confused cause in the pre.cluster command wiki I can see you explain that this mismatch parameter represents the double of bp’s.

Caveat emptor

Something to keep in mind is that when you set the number of mismatches to 2, you are allowing that the maximum difference between sequences within a cluster to be 4 (2 from the dominant sequence in one direction, and 2 in any other direction).

–> Do I then set diffs to be 2 in order to consider the lengh of my sequences, allowing 1bp diff per 100 bps?

Many thanks for all the help you provide and the thorough information in the forum!


I’d set diffs=4, which would give you clusters that can’t contain more than 2% divergent seqs.

Max diffs between seq1 and seq2 is 4, seq1 and seq3 is 4.
But since seq3 is never directly compared to seq2 and the 4 diffs between seq1 and seq2 can be completely different than the diffs between seq1 and seq3.
Seq2 and seq3 could contain 8 diffs from each other.
8/412 ~ 2%

1 Like

Many thanks for your answer and the walk through to understand the relation between the diffs parameter and the divergence between the sequences!

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.