Confused about cluster.split cutoff parameter

Hi Pat,

I’m confused about the cutoff parameter in the cluster.split function. I understand that you set the taxlevel parameter to prebin sequences using the taxonomy information and then cluster within that taxlevel to save memory. What exactly does the cutoff parameter do and why is it set to .15 in the MiSeq SOP? Is this just another measure to reduce memory usage?

Specifically, I am interested in generating OTUs with a 99% cutoff. To do this, do I simply have to run make.shared and classify.otu using a .01 label?

Thanks,
Michelle

The cutoff parameter is used to reduce the size of the distance matrices generated after the split. Any distances above the cutoff are ignored, creating a sparse distance matrix. When you run the clustering on a sparse distance matrix, the algorithm looks for pairs of sequences to merge in the rows and columns that are getting merged together. Let’s say you set the cutoff to 0.05. If one cell has a distance of 0.03 and the cell it is getting merged with has a distance above 0.05 then the cutoff is reset to 0.03, because it’s not possible to merge at a higher level and keep all the data. All of the sequences are still there from multiple phyla. For a worked example, http://www.mothur.org/w/images/7/7c/AverageNeighborCutoffChange.pdf.

Hi, so I ran cluster.split following the SOP, except with a cutoff of 0.03, because I have a small computer and I didn’t want to waste processing power calculating OTUs for distances I didn’t need.

Sometimes I got this message: “Cutoff was 0.035 changed cutoff to 0.03” (or to 0.02). Does this mean that my OTUs for some of my phyla are clustered at 98% similarity, not 97% (for 0.02), and is the 0.035 vs 0.03 message simply a result of the way numbers are handled? And, most importantly, can I say that my OTUs (when I run the make.shared command next) will be clustered at 97%? I’d like to understand if I’m handling the data incorrectly.

Thanks so much for your help,

Liz

Update: I tried the command again with cutoff = 0.15, just in case 0.03 was introducing errors, and got this message in the logfile associated with the command:

Clustering Weese2.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta.0.dist
Cutoff was 0.155 changed cutoff to 0.03
Cutoff was 0.155 changed cutoff to 0.01
It took 4842 seconds to cluster
Merging the clustered files…
It took 4 seconds to merge.

Output File Names:
Weese2.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list

And there is no data for 0.02 or for 0.03 distances in the list file. What has happened? Do the results of clustering change depending on random variables?