When I am using cluster command with a set cutoff value, it automatically changes the cutoff value. Here is the example (from log file)
_mothur > cluster(column=adultfinal.dist, name=adult.names, cutoff=0.10)
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
changed cutoff to 0.028546
Output File Names:
adultfinal.an.sabund
adultfinal.an.rabund
adultfinal.an.list
It took 29964 seconds to cluster._
I had cutoff value set to 0.10, but it changed to 0.028546.
Any suggestions please?
Thanks
This is one of our common questions, here’s Pat’s explanation. “This is a product of using the average neighbor algorithm with a sparse distance matrix. When you run cluster, the algorithm looks for pairs of sequences to merge in the rows and columns that are getting merged together. Let’s say you set the cutoff to 0.05. If one cell has a distance of 0.03 and the cell it is getting merged with has a distance above 0.05 then the cutoff is reset to 0.03, because it’s not possible to merge at a higher level and keep all the data. All of the sequences are still there from multiple phyla. Incidentally, although we always see this, it is a bigger problem for people that include sequences that do not fully overlap.” I would suggest increasing your cutoff.
I set a cutoff of 0.05, and it changed the cutoff to 0.0482271. Is this a big problem? You say to increase the cutoff, what would you recommend increasing it to?
Thanks!
It is not a problem. The recommendation to increase the cutoff is to resolve the problem of the cutoff dropping below a value you were looking to see, :).
Ok great.
But what does this cutoff actually mean … sorry, relatively new to this.
And does it cause a problem if I’m comparing this dataset with a cutoff of 0.4 to a different dataset that has a cutoff of 0.5?
Thanks!
The cutoff is used to boost speed and save memory. It does this by “ignoring” distances above the cutoff. For example, if you know that you are only interested in OTUs formed at a distance below 0.10, why keep a distance that is greater than 0.10? Because of average neighbor clustering method you might be interested in saving the distances smaller than 0.25, but you don’t need all the distances. What do you mean by comparing the two datasets?