read.dist() and/or cluster() error when using cutoff

Hello,

First, let me thank you for the wonderful contribution you’ve made with this software. It is helpful and easy to use. That said, I recently upgraded to v.1.16.1 and I may have identified a bug in either the read.dist() or cluster() commands.

My understanding is that cluster() should produce the same results (aside from variation due to breaking ties) regardless of whether a cutoff is set by the user. However, when I read a distance matrix into mothur while using the cutoff option (e.g., read.dist(phylip=“in_matrix.phy”, cutoff=0.15) ), cluster() returns a different number of otus for certain cutoffs less than and including the specified cutoff (e.g., 0.15) than if I ran read.dist() without a cutoff (e.g., read.dist(phylip=“in_matrix.phy”). I am using average linkage clustering and have evaluated this by running cluster() with and without a cutoff (e.g., cluster(method=average, cutoff=0.15) and cluster(method=average)) with no difference in the observation.

An example of the difference in output by mothur follows. Here, I have printed the first two columns of the *.an.list files. The first column corresponds to the cutoff threshold and the second column corresponds to the number of OTUs. I have printed the commands I used to generate the output above each set of results.

##With cutoff option##
#Commands

mothur > set.dir(output=…/db/samples/test_16S_reads/otus/BAC/)
Changing output directory to /Users/sharpton/projects/OTU/gittest/db/samples/test_16S_reads/otus/BAC/

mothur > read.dist(phylip=…/db/samples/test_16S_reads/matrix/test_16S_reads_SSU_BAC_FT_pseudo_pruned.phymat, cutoff=0.15)
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


It took 0 secs to read

mothur > cluster(method=average, cutoff=0.15)
changed cutoff to 0.0759733

Output File Names:
/Users/sharpton/projects/OTU/gittest/db/samples/test_16S_reads/otus/BAC/test_16S_reads_SSU_BAC_FT_pseudo_pruned.an.sabund
/Users/sharpton/projects/OTU/gittest/db/samples/test_16S_reads/otus/BAC/test_16S_reads_SSU_BAC_FT_pseudo_pruned.an.rabund
/Users/sharpton/projects/OTU/gittest/db/samples/test_16S_reads/otus/BAC/test_16S_reads_SSU_BAC_FT_pseudo_pruned.an.list

#Results

unique 140
0.00 70
0.01 61
0.02 56
0.03 48
0.04 42
0.05 38
0.06 34
0.07 32

Note how the maximum clustering threshold provided is 0.07.

##Without cutoff option##
#Commands

mothur > set.dir(output=…/db/samples/test_16S_reads/otus/BAC/)
Changing output directory to /Users/sharpton/projects/OTU/gittest/db/samples/test_16S_reads/otus/BAC/

mothur > read.dist(phylip=…/db/samples/test_16S_reads/matrix/test_16S_reads_SSU_BAC_FT_pseudo_pruned.phymat)
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


It took 0 secs to read

mothur > cluster(method=average, cutoff=0.15)

Output File Names:
/Users/sharpton/projects/OTU/gittest/db/samples/test_16S_reads/otus/BAC/test_16S_reads_SSU_BAC_FT_pseudo_pruned.an.sabund
/Users/sharpton/projects/OTU/gittest/db/samples/test_16S_reads/otus/BAC/test_16S_reads_SSU_BAC_FT_pseudo_pruned.an.rabund
/Users/sharpton/projects/OTU/gittest/db/samples/test_16S_reads/otus/BAC/test_16S_reads_SSU_BAC_FT_pseudo_pruned.an.list

#Results

unique 140
0.00 70
0.01 61
0.02 56
0.03 48
0.04 42
0.05 38
0.06 34
0.07 32
0.08 30
0.09 24
0.11 23
0.12 22
0.13 20
0.14 19

Note that results are given for cutoff values up to 0.14 (0.15 has the same distribution of OTUs as 0.14 in this case).

It’s possible that I am misunderstanding the role of the cutoff option in read.dist(), but these observations seemed to counter the description in the manual. If there is any additional information I can provide, please let me know.

Best,
Thomas

Thanks, Thomas. Yes we know about this and it isn’t exactly a bug. mothur stores the distance matrix by excluding distances above your cutoff. This can cause issues for average neighbor because it averages distances and in some cases may be looking for distances that have been excluded. If this happens, then mothur is smart enough to drop the cutoff to the range where it can see the distances. If you want distances that have been removed, you need to increase the cutoff. Sorry we haven’t gotten to putting up a hand-worked example of this…

Pat

Hi Pat,

Thanks for clarifying this point; it makes complete sense and explains the observation that average neighbor results are obtained for previously excluded cutoff values when the read.dist cutoff is increased. I’ll appropriately amend my code to accommodate this feature.

Best,
Thomas