I have data-set of (mostly) full-length environmental 16S sequences that I’m trying to cluster into OTUs, however a decent chunk of the reads (~30%) are long partial sequences. Worst case scenario, I could have two non-overlapping reads covering different ends of the 16S gene from the same OTU that I’d like to cluster together. Obviously I can’t cluster them without any mutual information - without at least 1 overlapping read - but if I do have such a read, how do I make it at least possible that such a clustering occur?
Will Dist.Seqs give them a distance of 1, i.e. complete divergence?
If I understand Cluster correctly, if they have a distance of 1 they won’t cluster under default conditions (method=furthest) and extremely unlikely method=average without very large sample sizes, but might work with method=nearest. Is that correct?