Talmudic question #1

I am hoping that mothur users can weigh in on a question that we’ve previously ignored, but seems to be getting raised more and more these days. Namely, which clustering algorithm is the best to use. Way back when, I proposed using furthest neighbor (fn) because it gave the most conservative estimate regarding how well one has sampled a community. Another advantage of fn is that one knows what an OTU represents - all of the sequences within an OTU are within the cutoff of each other. The problem with this approach is that it is fairly stringent and has the propensity to leave out sequences that are similar to many of the sequences within the OTU, but not all. This is in contrast to the nearest neighbor (nn) algorithm where a sequence can join an OTU if its within the cutoff of at least one member of the OTU. So in these OTUs sequences can be much further apart from each other than the cutoff, but you know that no one was left out. Finally, we have the average neighbor algorithm (an), which is basically a hybrid approach. Up until now, this has been the basis for the argument, which may or may not be very strong. A couple of recent events are causing me to question whether this is the best logic…

First, there is a paper coming out soon in Environmental Microbiology where Huse and colleagues argue for AN because when they use it (with a preclustering step) they are able to get OTU counts for a mock community that makes more sense. In other words, fn seems to inflate the number of OTUs. This also seems like over-fitting a method to data.

Second, at the ICoMM meeting I was at, Anders Anderson suggested a method of quantifying the validity of output by different clustering algorithms. Basically, you can think of clustering as an attempt to get similar sequences into the correct OTU. So if the pairwise distance between two sequences is below a threshold you consider it a positive (i.e. they belong in the same OTU). If it is above the threshold it is a negative (i.e. they belong in separate OTUs). Then you can go through a list file and ask how many positives end up in the same OTU (i.e. true positive) or different OTUs (i.e. false negatives). You can also ask how many negatives end up in different OTUs (i.e. true negatives) or the same OTU (i.e. true positives).

So I have taken this idea and applied it to a collection of 13,501 full-length sequences. Below are the counts from the three algorithms that I found when clustering sequences in to OTUs at a 97% similarity cutoff (0.03)…

Nearest Neighbor
TP = 176341
FP = 257585
TN = 90697824
FN = 0

Average Neighbor
TP = 140736
FP = 13993
TN = 90941416
FN = 35605

Furthest Neighbor
TP = 70873
FP = 0
TN = 90955409
FN = 105468

If you calculate the F1 score (http://en.wikipedia.org/wiki/F1_score) for each of these you get 0.57, 0.85, and 0.57, respectively.

So there are three questions…

  1. What is more problematic: false positives or false negatives
  2. How important is knowing what an OTU really represents?
  3. Which algorithm do you think is best?

I favor AN for the same reasons as Huse. I saw a Sogin study which said the same thing.

But I use FN in my papers because “that’s what all my friends are doing”, and I want to be comparable.


I think nearest neighbor makes the least sense and doesn’t appear to have many backers. One could imagine (eventually) chaining together the entire 16s sequence continuum with each sequence differing from the next by 0.03.

Imagine I live on the East end of a city block and I have a friend, Larry, who lives on the extreme west end. We are 0.03 miles apart (a very small block, I know). Now imagine I’m also friends with Lamont, who lives on the next street just past Larry. On average, Lamont lives just 0.03 miles apart from all the other residents on my block, but we still consider him to reside in the next block.

But OTUs aren’t city blocks. They are arbitrary surrogates for species and phyla to begin with. So why set a hard and fast rule for a soft definition? I think ultimately I would agree with James. Since most of us are interested in comparing between samples, so long as we are consistent in applying one distance metric, our results should be informative. But in the interest of comparing between publications (always a bit dicey anyway), perhaps the standard should be to provide both sets of numbers and discuss at least one.


I suppose the more significant issue for me is “why don’t we use evolutionary distance corrections”? With long-diverged species, we really ought to be using evolutionary models for correction, such as HKY.

I’d say that we don’t use HKY because it’d be next to impossible to fit and as far as I know, HKY gags on gaps (or just ignores them)

I must be missing something because a common clustering method in ecology is ward’s and it tries to reduce the SEM by grouping like numbers.

Maybe it should be tried in mothur?

thanks - it’s an idea, but i think the algorithmic complexity might be prohibitive.