Clustering a 10GB distance matrix


I just started using mothur a month ago, and the wiki and forum have been very useful resources as I learn my way around, so thanks all for maintaining those.

As background, I’m running mothur on an 8-processor Mac with OS 10.6 and 18 GB of RAM and 60 GB of swap memory on a SSD. I’m following the example presented in the Costello stool analysis to analyze environmental bacterial 454 sequences, and have a 0.25 distance matrix of 45,000 sequences. Since the file is 9.7 GB and hcluster is read/write intensive, I moved the matrix file onto the SSD and started hcluster; however, after 4 days it didn’t yield any output. Then since I had available RAM and swap memory, and no success with hcluster, I tried using the cluster command (average neighbor, 0.03 cutoff). Reading it into memory and doing the clustering procedure took up 16 GB of RAM plus 26 GB of swap memory; I started this Wednesday and terminated it today (run time of 5 days) because it didn’t progress any further after finishing the ‘unique’ on day 1. The commands for those procedures are below. My questions are:

1). We decided to use the SSD to minimize read/write time, so we allocated 60GB of swap memory on it. It is working as swap memory, except it didn’t seem to help the 18GB of RAM with clustering. Would this swap memory have been able to ‘trick’ mothur into thinking I had more RAM?
2). Though the file size is almost 10 GB, I have enough RAM available to read it into memory. Do you know why neither cluster nor hcluster worked? Or should I have given either/both more time?
3). Using a 0.10 distance matrix was a smaller file size but changed my cutoff. Should I try making a 0.15 or 0.20 distance matrix, to get the final clustering cutoff of 0.03?

Thanks and cheers,

mothur > dist.seqs(fasta=B.1233.fasta, cutoff=0.10, processors=8)

Output File Name:

It took 2042 to calculate the distances for 44529 sequences.

I changed the filename to B.1233.10.dist.

mothur > read.dist(column=B.1233.10.dist, name=B.1233.names)
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||

It took 607 secs to read

mothur > cluster(method=average, cutoff=0.03)
changed cutoff to 0.0212287

Output File Names:

It took 71663 seconds to cluster

I tried with a 0.25 distance matrix.

mothur > dist.seqs(fasta=B.1233.fasta, cutoff=0.25, processors=8)

Output File Name:

It took 2589 to calculate the distances for 44529 sequences.

I changed the filename to B.1233.25.dist.

hcluster(column=B.1233.25.dist, name=B.1233.names, method=average, cutoff=0.03)

With no progress after 4 days, I terminated the command and changed tactics.

mothur > read.dist(column=B.1233.25.dist, name=B.1233.names)
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||

It took 1311 secs to read

mothur > cluster(method=average, cutoff=0.03)

quitting command…

With no progress after 5 days, I terminated the command.


So I think the problem is the number of sequences going through the pipeline - I’d be surprised if you have 45k sequences that are unique. Are you doing all of the trimming steps outlined in the Costello example? Are you running pre.cluster? How many sequences are in your groups file? It should definitely be possible - get me the answers to these questions and we can get you going…


I ran pre.cluster, and it removed 11,500 sequences–enough to get the size down so that the resulting distance matrix clustered in a day. I didn’t run it initially because I had denoised my sequences before processing them through the commands in the Costello example. So when I got to the pre.cluster step and read that it was for removing noise, I had skipped it.

So now the question remains of whether swap memory on an SSD would have acted similar to regular RAM to have helped in clustering a distance matrix. Any thoughts? Or is that the point where hcluster would be more useful?