Clusters.split Changes Cut-off To Unlikely Similarity Thresh

Hello Mothur Users,

Has anyone noticed that the cluster.split command produces unexpectedly low similarity cut-offs, i.e. : “Cutoff was 0.255 changed cutoff to 0.06…It took 46525 seconds to cluster.”

For a massive dataset (~ 200 000 unique sequences), this degree of similarity is suspect and, better yet, I know it is suspect B/C I had originally run a subset of this dataset (~100 000 unique sequences) independently and the cut-off was automatically set to 0.09 (a much more believable level). I can’t think of a legitimate explanation for how adding additional sequence diversity would reduce the global cut-off for similarity.


Please help,

Thanks!

I’m not sure what you mean by this being “suspect”. You might read the FAQ regarding the change in the cutoff…

http://www.mothur.org/wiki/Frequently_asked_questions#Why_does_the_cutoff_change_when_I_cluster_with_average_neighbor.3F

Pat

Thanks for pointing me in the right direction, but you’ll have to forgive my inability to conceive of what is happening here. My main stumbling block seems to be that I interpret the dissimilarity cutoff to mean: “beyond this level of dissimilarity, all sequences will form one group.” So, for instance, if the cut-off is 0.09, all sequences would not be greater than 10% different. Is this correct?

If it is more or less correct, is it not unusual that the greatest dissimilarity I find in a subset of data would not be preserved after adding more sequences. In other words, if one dataset has a distance cut-off of 0.09, expanding the dataset would not remove the fact that group A and group B will have, at best, 0.09 dissimilarity. How, in my case, is it possible for group A and B to now share no more than 0.06 dissimilarity?

Thank you for your patience,

Roli

Hi Pat

I’ve looked at Roli’s process and agree something is going on that I can’t help him understand. So I’m rewriting his question to try to explain it better.

He has one half plate of sequences that he processed according to the 454 SOP, including cluster.split where the cutoff was reset to 0.09. He then reprocessed those sequences along with 2 half plates of my sequences (which I’d previously run through the SOP and cluster.split and got 0.09 cutoff) by combining the his unique’d/chimera checked fasta, names, and groups with my unique’d/chimera checked files. This combined dataset is where he’s getting 0.06 as his furthest clustering distance. I can’t figure out how 2 halves of a dataset have 9% clusters but when combined only 6%?

So the clustering is dependent on the data you give it, so if you change the data, the output may change slightly. The reason the cutoff changes is because we throwaway the distances above the cutoff. Since you calculate an average distance when merging OTUs in the AN algorithm, you have to adjust the cutoff if you’re attempting to calculate an average using a distance that has been thrown away. Put simply, the change in cutoff insures that the clusters you get now at 0.06 are just as good as those you got at 0.06 before, even though it went to 0.09.

Also, I think you guys might be misinterpreting what the highest cutoff means - it doesn’t mean that at 0.10 there’s only 1 OTU, it just means that the clusters weren’t made at that cutoff. If you want the higher cutoffs, you need increase your cutoff in the dist.seqs command. Make sense?

Pat

Thanks for following-up with us. I will try re-calculating the distance matrix at a cut-off of 0.45 instead of 0.3.

I’ll be honest that I still don’t quite understand your explanation, but I’d assume it’s because I’d have to peak under the hood. My hang-up is that I don’t know how you’d be able to throw away distances (assuming they have been calculated based on your dist.seqs command). Since the 0.3 cut-off was applied two both of my sets of sequences, I don’t understand how the distances do not exist at the 0.9 level and thus, cannot be clustered at that level.

The reason that I’m interested in really understanding what’s going on is because I like to compare my communities at several levels (3, 5/6, 9/10) so only getting the genus and species level clusters is a bit of a problem

The reason that I’m interested in really understanding what’s going on is because I like to compare my communities at several levels (3, 5/6, 9/10) so only getting the genus and species level clusters is a bit of a problem

This is really asking too much of the data - there are no distance thresholds that align with taxonomic levels. We describe this in our 2012 AEM clustering paper and others have done so as well. There just isn’t a cutoff for genus, family, order, etc. If one wants these types of cutoffs, it is best to use phylotype. If you want sub-genus levels, use OTUs. So… I’m not sure why anyone would really want anything above 0.03 or 0.05.

So I’ve posted a hand worked example to the wiki (http://www.mothur.org/wiki/Frequently_asked_questions#Why_does_the_cutoff_change_when_I_cluster_with_average_neighbor.3F). If you look at the pdf that is linked there you will see the same data processed in two ways. On the left is the full distance matrix with all of the distances - this is what you get when you run cluster.classic(phylip=whatever.dist, cutoff=0.10). On the right, is what you get when you run cluster(phylip=whatever.dist, cutoff=0.10). In cluster, any distance larger than 0.10 is removed from the matrix. We do this because it saves on RAM. As you go down the steps in the pdf, you’ll see a few red XXXXX’s in the right hand side (e.g. the distance between U68619 and U68602 in the third clustering step was ignored since it was above 0.10. This is x’d out because the distance between U68598 and U68602 is 0.0902, which was clearly less than 0.10. What the algorithm here wants to do is calculate the average between the two distances, but we only have one distance. The solution, is to change the cutoff to 0.0902 so the problem goes away. This keeps on going until we get to the point that the cutoff is changed to 0.0791. Hopefully, this makes a little more sense for why the cutoffs change. The upshot of this is that when you run…

cluster.classic(phylip=full.square.dist, cutoff=0.10, precision=10000)

and

cluster(phylip=full.square.dist, cutoff=0.10, precision=10000)

You get the same output up until 0.0791.

Pat