Problem with cluster

When I cluster a dataset with method=average it runs really quickly and the output just gives data for the unique clusters. When I run the same data using methods nearest or furthest it takes a long time to run and I get output for all levels from 0.01 to 0.10 in addition to the unique. I’ve gotten the same results from two different datasets. I am running mothur 1.9 on Windows.
Thanks,
Scott

Scott - can you email us a copy of the distance and names files? mothur.bugs@gmail.com.

Thanks,
Pat

Hi,
I’m having the same problem with the “average” clustering algorithm, although I’m using v.1.11. I originally used v.1.6, and that worked fine, but when I run the same preliminary series of commands, but then try to cluster with v.1.11, after about 30 seconds I get a normal-looking “unique” line and then I get the message “changed cutoff to 0” and the cluster command terminates. Here are the commands I’ve used:

unique.seqs(fasta=Aodon_926.fasta);
align.seqs(candidate=Aodon_926.unique.fasta, template=/seq/microbiome/Softs/MICROBIOMEUTIL_VERSIONS/current/RESOURCES/rRNA16S.gold.NAST_ALIGNED.fasta, processors=1, align=needleman, search=kmer, ksize=8);
filter.seqs(fasta=Aodon_926.unique.align, vertical=T);
dist.seqs(fasta=Aodon_926.unique.filter.fasta, calc=nogaps, cutoff=0.05);
read.dist(column=Aodon_926.unique.filter.dist, name=Aodon_926.names, cutoff=0.05, precision=100);
cluster(method=average, cutoff=0.03, precision=100);

I can give you the files involved if you’d like them.
Thanks,
Gabe

Gabe - we fixed a bug in the average method that made it look like everything was cool, when, in fact, it wasn’t. Because of the way the clustering algorithm works when using a sparse matrix, it is necessary to adjust the cutoff. I’d suggest setting a higher initial cutoff and seeing what happens from there.

I have the same problem. If i run read.dist command with cutoff option it doesn’t work:

mothur > read.dist(phylip=1810Ubac.good.filter.phylip.dist, cutoff=0.10)

********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     ||||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
It took 22 secs to read 

mothur > cluster(method=average)

unique  2       6007    405
changed cutoff to 0

Output File Names: 
1810Ubac.good.filter.phylip.an.sabund
1810Ubac.good.filter.phylip.an.rabund
1810Ubac.good.filter.phylip.an.list

It took 33 seconds to cluster

Without cutoff option average method works fine.

I wonder if your sequences all overlap. When you run filter.seqs are you also doing trump=.? It’s possible that if your sequences don’t fully overlap then you will have problems.

I didn’t use trump=. before.
After filtering with trump=. i have a bit different output:

mothur > read.dist(phylip=1810Ubac.good.filter.phylip.dist, cutoff=0.10)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     |||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
It took 23 secs to read 

mothur > cluster(method=average)
changed cutoff to 0.0454141

I can’t reach cutoff=0.10 still.
I else can’t reach cutoff=0.10 on another dataset even without cutoff option.

mothur > read.dist(column=mid3.good.filter.unique.dist, name=mid3.good.filter.names)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     |||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
It took 10 secs to read 

mothur > cluster(method=average)
changed cutoff to 0.0424716

In this example i used trump=. too.

Right - try a cutoff of 0.20 or 0.25 if you want to get to 0.10

A cutoff of 0.20, 0.25 and higher values don’t help.

Hello,

I am having the same problem with average neighbor. I have a dataset of about 3500 almost full length sequences and clustering with average neighbor always leads to a changed cutoff around 0.002. Furthest neighbor does not lead to the same error. I have tried this with several versions of mother and have received the same result. I have also tried this with several different subsets of the data and received the same result. However, with similar datasets in the pasts, average neighbor worked just fine. I am happy to provide whatever additional information that might be useful.

Thanks,
Angus

I found that successful output of cluster(method=average) command depends on distance matrix file format.
It works for phylip format and doesn’t work for column format. I got these results for my two different datasets.

Can each of you run summary.seqs on the fasta file you use to calculate distances and post the results?

                Start   End     NBases  Ambigs  Polymer
Minimum:        1       386     133     0       3
2.5%-tile:      1       386     164     0       3
25%-tile:       1       386     166     0       4
Median:         1       386     190     0       4
75%-tile:       1       386     191     0       6
97.5%-tile:     1       386     192     0       6
Maximum:        3       386     204     1       9
# of Seqs:      2334

@MetalAlex - when you set cutoff to 0.20 or 0.25 (in dist.seqs), what is the final cutoff value? Feel free to email the fasta or dist matrix file to mothur.bugs@gmail.com

I tried to change cutoff option only in cluster and read.dist commands, cuttoff=0.25 in dist.seqs command helps. My final cutoff value is ~ 0.11
Thanks!
I understood why cluster(method=average) worked ok with phylip matrix, when i run dist.seqs with output=lt and cutoff=0.10 options, the cutoff option is ignored.

Ok - it sounds like the program is working the way it should then. If you set output to lt in dist.seqs then the cutoff is ignored because it doesn’t cost anything extra to store the real distance over some other placeholder in the distance matrix.

  Start End NBases Ambigs Polymer
Minimum: 1 632 436 0 4
2.5%-tile: 1 1208 859 0 5
25%-tile: 1 1227 1180 0 5
Median:  1 1227 1210 0 5
75%-tile: 1 1227 1212 0 6
97.5%-tile: 15 1227 1221 0 7
Maximum: 633 1227 1224 2 9
# of Seqs: 3591

Hi Pat,
Here is my output from summary.seqs.
So I think the problem is semi-fixed for average neighbor with this particular dataset. If I use the output=lt option when making the distance matrix and then set the cutoff to 0.2 when reading the phylip formatted matrix, then average neighbor automatically sets the cutoff to 0.039. I am still a little fuzzy on why this is occurring, but I am happy that I do not need to analyze distances greater than 0.03 (which I am currently not planning on doing with this dataset). When I look at the distance matrix, distances do exist between 0.039 and 0.2.

Next I tried this to get a better feel as to what is going on.
I ran trim.seqs and removed all sequences that are below a minimum length of 1000 bp. Now summary.seqs gives this

  Start End NBases Ambigs Polymer
Minimum: 1 1000 1000 0 4
2.5%-tile: 1 1077 1077 0 5
25%-tile: 1 1185 1185 0 5
Median:  1 1210 1210 0 5
75%-tile: 1 1212 1212 0 6
97.5%-tile: 1 1222 1222 0 7
Maximum: 1 1224 1224 2 9
# of Seqs: 3375

Now if I make a lower triangle distance matrix, read it in with a cutoff of 0.2, and then cluster with average neighbor, the automatic cutoff gets changed to 0.0849. When I look at the distance matrix, distances do exist between 0.0849 and 0.2.
The sequences that were removed by trim.seqs were of high quality and passed all previous quality control steps.

Thanks,
Angus