wrong clustering


I used mothur to cluster my data and to extract representative sequences for distance=0.05.
After blast search on ncbi website i found that some representative sequences have same best homolog.
I aligned my representative sequences and calculated pairwise distances between them.
I found that many representative sequence pairs have distance far below 0.05. For example:

GCFD4AW02IZARV|45|83 GCFD4AW02G509M|63|20 0.006061
GCFD4AW02IZARV|45|83 GCFD4AW02F322M|40|24 0.006061
GCFD4AW02IZARV|45|83 GCFD4AW02GBM51|52|49 0.006061
GCFD4AW02IZARV|45|83 GCFD4AW02HED3N|58|52 0.006024
GCFD4AW02IZARV|45|83 GCFD4AW02I1DPU|42|59 0.006024
GCFD4AW02IZARV|45|83 GCFD4AW02IYGDP|28|80 0.006061

How is it possible that these sequences are in different OTUs?

Here is the summary for my data after align.seqs, screen.seqs and filter.seqs commands:

mothur > summary.seqs(fasta=1810Ubac.good.filter.fasta)

Start End NBases Ambigs Polymer
Minimum: 1 412 145 0 3
2.5%-tile: 82 412 165 0 3
25%-tile: 82 412 165 0 4
Median: 82 412 190 0 4
75%-tile: 82 412 191 0 5
97.5%-tile: 82 414 193 0 6
Maximum: 82 492 220 1 9

of Seqs: 6817

Then i ran commands:
dist.seqs(fasta=1810Ubac.good.filter.fasta, calc=onegap, countends=F, cutoff=0.10, output=lt, processors=3)
read.dist(phylip=1810Ubac.good.filter.phylip.dist, cutoff=0.10)
get.oturep(phylip=1810Ubac.good.filter.phylip.dist, fasta=1810Ubac.fas, sorted=size, list=1810Ubac.good.filter.phylip.fn.list, label=0.05)

I use mothur v.1.11.0

Thank you.

Clustering is the heart and soul of the mothur program. GOD HELP US ALL if there’s a bug in that code!!!
:evil: :o :shock: :evil: :o :shock: :evil: :o :shock: :evil: :o :shock: :evil: :o :shock: :evil: :o :shock: :evil: :o :shock: :evil: :o :shock: :evil: :o :shock: :evil: :o :shock: :evil: :o :shock:

Mothur’s default clustering algorithm (which you are using by not specifying an algorithm) is furthest neighbor. That means the BIGGEST distance between any two sequences in the OTU is X. That doesn’t exclude the possibility that any particular sequence in the OTU might be closer than X away from a sequence from a different OTU. Consider this toy example:



type the following magic incantation:


As you can see by examining test.fn.list:

unique 9 a1 a2 a3 b1 b2 b3 c1 c2 c3 
0.20 4 a2,a1 b3,a3 b2,b1 c3,c2,c1 
0.40 1 c3,c2,c1,b2,b1,b3,a3,a2,a1

the clustering looks reasonable. Read the section titled “Variability” in the Mothur manual for the Cluster command if your results look different. Actually, read it even if your results are exactly the same as mine.

When you examine test.fn.0.20.rep.dist:

b3|2|2 a2|1|2 0.4
b2|3|2 a2|1|2 0.2
b2|3|2 b3|2|2 0.2
c3|4|3 a2|1|2 0.4
c3|4|3 b3|2|2 0.2
c3|4|3 b2|3|2 0.4

there are examples of distances AT the 0.2 cutoff BETWEEN representatives of different OTUs! I think them’s the breaks when the number of variant positions in the alignment << the alignment length. Perhaps your sequences are short or they aren’t very diverse. If you were to visualize your sequences in n-dimensional space, they aren’t very “clumpy.” This is both the bad and the good of the OTU approach.

I believe average neigbor is slightly more immune to this problem than furthest neighbor. And nearest neighbor is almost completely immune – which is exactly why I hate my nearest neighbor :x As an exercise try the other algorithms on the toy data as well as your own dataset.


Wow, thanks Robin. Feel free to send us the original distance matrix of representative sequences and we can take a look. We are actually about this close ’ ’ to changing the default to average neighbor because it is actually much better than furthest or nearest at making sure that the right sequences show up in the right OTUs.

Thanks for the answers.
Here is the matrix:
Upload not implemented: rep.filter.dist.zip
Looks like average neighbor clustering is much more reasonable.
I got 153 OTUs with average method, and there are no problem with close representative sequences now.
My result with furthest method is 237 OTUs.


I just figured I would weigh in on the furthest vs average clustering algorithms and describe a little exercise we did to convince ourselves that average neighbor led to the most robust OTUs.
Using a distance matrix constructed from about 4,000 sequences, we ran each clustering algorithm 100 times to create a total of 200 different sets of OTUs.
Now for each algorithm, which has different 100 OTU sets, we first looked at each OTU set individually. For all possible pairs of sequences, we assigned them a 0 or a 1, 0 meaning they are not in the same OTU and 1 meaning they are in the same OTU. Then we did this for all 100 OTUs sets and added the 0s and 1s for each sequence pair. Finally we made as histogram of the value of each sequence pair.
As you can imagine, if the clustering algorithm leads to exactly the same OTUs each time, then this histogram will only have values in the 0 or the 100 columns. However, the true shape was U shaped, with most sequence pairs always (or never) being put together, but a few pairs that jump around more. By looking at the histograms it was clear that average neighbor was more consistent with its OTUs.


Hmm. Interesting experiment. But if you were to repeat it using nearest neighbor, I’d bet the separation would be even cleaner. Does that make it the correct algorithm to use? I don’t think so. What if ambiguity in clustering is the correct answer? My intuition still tells me average neighbor is best (or at least, the least nonsensical of the three). I just don’t think this is the right experiment to prove it.


P.S. I’d like to see median neighbor implemented as well.

For this particular dataset, which is mostly Proteobacteria, nearest neighbor at 0.03 clumps about half of the sequences into one big OTU. Because of this, we didn’t explore it much further, though I seem to remember the separation being better.