Cluster "X% difference" misleading?


I’m currently considering clustering using mothur, but am confused by your indication of % difference. From looking at the code, it does not appear that there is any collapsing at percentage difference levels, but rather just clustering using the common complete linkage (furthest neighbor) algorithm and the distances from the input matrix. Is this correct? Or, is there processing of the resulting cluster after the fact that I’m missing that collapses OTUs based on % difference?

I’m not using mothur to align or calculate distances, but am instead inputting a phylip-formatted matrix I’m creating manually using another distance metric that is not based on the % difference that dist.seqs will give.


I’m not sure that I follow… The distances that mothur gives are cutoffs, which indicate - in the case of the futhest neighbor/complete linkage - that the largest distance between any of the sequences in an OTU is that cutoff. How is this misleading?

On this page, you state that for complete linkage clustering, “All of the sequences within an OTU are at most X% distant from all of the other sequences within the OTU.” It sounds like you’re equating X% distant with distance between reads/nodes when the reads are clustered. But this is not the case, I don’t think. I don’t think that sequences that are 5% distant have a distance of 0.05, necessarily.

So, I’m trying to get at how OTUs are created once the reads are clustered. You provide cutoff and precision parameters for the clustering process which are straight forward, but what about what is done once the clustering procedure is complete? Are OTUs creating during the clustering (which I’m separating as simply the joining reads into further and further distant nodes) or are they created after the clustering process is complete (when the final node on the tree contains all reads in the set)?

It could be possible to traverse the final tree once the clustering is complete in order to determine %difference as the cluster page suggests, by visiting all nodes in the tree. Is this what’s happening?

Hopefully, the question is more clear now. I’ll keep digging around the code to see if I can get any more information.

Thanks again for your help.

Sorry, I’m dense. I still don’t get it. All of the sequences within an OTU at the 0.05 level are at most 5% different from each other.

I’m not sure what you’re saying, but…
…this says nothing about how similar one OTU is to the next.
…the distinction between distance and difference are going to be made by how the user defines and calculates the values.

What the algorithm does is this…

  1. Search for the minimum distance in the matrix and join the OTUs/sequences represented by the distance
  2. Merge the OTUs/sequences by some rule as defined by NN, AN, or FN
  3. Repeat

As the minimum distance in the matrix increases we look for it to cross thresholds defined by the precision level. Once you reach 0.0549 and you know that each OTU has X, Y, Z, and etc. number of sequences, those data are pumped out for a distance level of 0.05 and a precision of 100.

Thanks for responding again. Maybe I’m confusing the issue. Here’s an example:

I have the lower triangle of a Phylip-formatted distance matrix that I’m computing manually from global, pairwise distances that are not based on some % difference between sequences (as in dist.seq) from a set of approximately 39K unique reads across 16 samples. I read in this matrix into mothur, run hcluster, and get (among many other lines), the following line:

0.05 13 17766 4885 829 195 89 43 19 18 95 5 1 1

So here’s how I’m interpreting these results. Please correct me where I’m wrong.

At the 0.05 distance cutoff:

  • There are 13 reads in the largest OTU
  • There are 17766 OTUs with only 1 read
  • There are 4885 OTUs with 2 reads
  • There are 829 OTUs with 3 reads
  • There is only 1 OTU with 11 reads
  • There is only 1 OTU with 12 reads

In total, at 0.05, there are 13 OTUs. Also, at the 0.05 distance cutoff, are you saying that all 13 of these OTUs only differ by 5%? Or are you saying that within each OTU category, the reads differ by only 5%? Or both? I’m having a hard time wrapping my brain around this (obviously ;-)). It seems like all of the OTU categories would differ by a distance of 0.05, which is not guaranteed to be the same thing as difference of 5%.

So, what do you think? Am I totally crazy?


Ok, we’re getting there…

0.05 13 17766 4885 829 195 89 43 19 18 95 5 1 1

Your interpretation is mostly correct (except that you have 1 12-ton and 1 13-ton) until you get to the total number of OTUs. It isn’t 13, rather it is 17766+4885+829+…+1+1 or 23,946 total OTUs. Within each of these 23,946 OTUs the most any of the sequences within any one OTU is 0.05. We don’t know how different OTU1 is from OTU2, but that is an interesting question for another day.

Hope this helps,

OK, thanks. Back to the distance vs. % difference question. If we’re using a distance metric that is not based on uncorrected pairwise distances like those calculated from dist.seqs are, it does not seem possible to use the 0.03, 0.05, 0.10 as reasonable OTU cutoffs, assuming that these distances/percent differences relate to some taxonomic level. If you look at the logdet/paralinear distance, for example, the range of distances could be [0,Infinity] making % impossible. However, under those calculated by dist.seqs, the range of distances is bounded [0,1] making %s much more interpretable. What do you think? We have some ideas on how to relate % difference to these corrected distances, but want to make sure that this makes sense with what mothur is doing, first.


we (i.e. the field) are generally pretty sloppy about distance/difference. reviewers ask for units on a distance and we give them %. you’re right that corrected distances can be larger than 1, which is obviously awkward. with well aligned 16s sequences this doesn’t happen even with corrected distances. the max distance that dist.seqs can calculate is 1. i’d also say that any true distance that corrects for multiple substitutions will have to have 5 characters to include gaps - this is rarely, if ever done. it seems like people only use a multiple substitution correction out of some sense of phylogenetic guilt.