Pre.cluster command

Hi I just wanted to confirm this with someone that my understanding of the pre.cluster step is correct:

Performing the pre.clustering step doesn’t actually remove or change any sequence, it just “preclusters” them with a unique sequence, so the subsequent processing “considers” it as 100% similar to the “merged” unique sequence, but if you go back to look at the DNA sequence in the fasta file, the sequences are still different.

The reason I ask is that after looking at the DNA sequences of some sequences from the same OTU (~100 bp, called at 97%), I see some sequences that have 4 - 5 mismatches. I think this is mostly likely due to how the pre.clustering step was carried out, but I just wanted to confirm that my thinking and understanding of the step is correct.

FYI, I set the diffs = 1 for pre.cluster.

Thank you so much!

If you are using diffs=1, then the furthest any two sequences should be from each other within a cluster would be 2 bp. The clusters resulting after pre.cluster are not the same as OTUs and so it would be reasonable to expect more variation within an OTU. Recall that the default clustering method is the average neighbor algorithm, which requires the OTUs to be on average at most 3% different from each other. So it is reasonable for there to be <5 differences between sequences, although those would be expected to be rare.

Got it! Thanks so much Pat!