I am hoping that mothur users can weigh in on a question that we’ve previously ignored, but seems to be getting raised more and more these days. Namely, which clustering algorithm is the best to use. Way back when, I proposed using furthest neighbor (fn) because it gave the most conservative estimate regarding how well one has sampled a community. Another advantage of fn is that one knows what an OTU represents - all of the sequences within an OTU are within the cutoff of each other. The problem with this approach is that it is fairly stringent and has the propensity to leave out sequences that are similar to many of the sequences within the OTU, but not all. This is in contrast to the nearest neighbor (nn) algorithm where a sequence can join an OTU if its within the cutoff of at least one member of the OTU. So in these OTUs sequences can be much further apart from each other than the cutoff, but you know that no one was left out. Finally, we have the average neighbor algorithm (an), which is basically a hybrid approach. Up until now, this has been the basis for the argument, which may or may not be very strong. A couple of recent events are causing me to question whether this is the best logic…
First, there is a paper coming out soon in Environmental Microbiology where Huse and colleagues argue for AN because when they use it (with a preclustering step) they are able to get OTU counts for a mock community that makes more sense. In other words, fn seems to inflate the number of OTUs. This also seems like over-fitting a method to data.
Second, at the ICoMM meeting I was at, Anders Anderson suggested a method of quantifying the validity of output by different clustering algorithms. Basically, you can think of clustering as an attempt to get similar sequences into the correct OTU. So if the pairwise distance between two sequences is below a threshold you consider it a positive (i.e. they belong in the same OTU). If it is above the threshold it is a negative (i.e. they belong in separate OTUs). Then you can go through a list file and ask how many positives end up in the same OTU (i.e. true positive) or different OTUs (i.e. false negatives). You can also ask how many negatives end up in different OTUs (i.e. true negatives) or the same OTU (i.e. true positives).
So I have taken this idea and applied it to a collection of 13,501 full-length sequences. Below are the counts from the three algorithms that I found when clustering sequences in to OTUs at a 97% similarity cutoff (0.03)…
Nearest Neighbor
TP = 176341
FP = 257585
TN = 90697824
FN = 0
Average Neighbor
TP = 140736
FP = 13993
TN = 90941416
FN = 35605
Furthest Neighbor
TP = 70873
FP = 0
TN = 90955409
FN = 105468
If you calculate the F1 score (http://en.wikipedia.org/wiki/F1_score) for each of these you get 0.57, 0.85, and 0.57, respectively.
So there are three questions…
- What is more problematic: false positives or false negatives
- How important is knowing what an OTU really represents?
- Which algorithm do you think is best?