Dist.Seqs/Cluster and Non-Overlapping Seqs

I have data-set of (mostly) full-length environmental 16S sequences that I’m trying to cluster into OTUs, however a decent chunk of the reads (~30%) are long partial sequences. Worst case scenario, I could have two non-overlapping reads covering different ends of the 16S gene from the same OTU that I’d like to cluster together. Obviously I can’t cluster them without any mutual information - without at least 1 overlapping read - but if I do have such a read, how do I make it at least possible that such a clustering occur?

Will Dist.Seqs give them a distance of 1, i.e. complete divergence?

If I understand Cluster correctly, if they have a distance of 1 they won’t cluster under default conditions (method=furthest) and extremely unlikely method=average without very large sample sizes, but might work with method=nearest. Is that correct?


That’s correct - but I wouldn’t even trust the nearest since you don’t have the mutual information and nearest is a pretty poor method in general. I would use screen.seqs and fliter.seqs to only consider those bases that overlap.