Computer Issues with hcluster

bpyoumans · May 20, 2011, 6:14pm

Hello mothur team,

We were following the Costello Analysis for one our datasets and after the pre.cluster step we got about 120,000 sequences. Dist.seqs created a distance matrix that has 17 million lines and is 160 GB. We tried to run the cluster command. It depleted the memory so much that we were afraid the computer would shut off (itâ€™s happened before), so we aborted the command. Since hcluster should have a smaller memory footprint, we decided to try that. However, the next morning the computer had an error message that said â€œnot enough disc spaceâ€. The mothur log said, â€œIt took 70844 seconds to sort.â€ Weâ€™re running this on a MacPro with 2 dual core processors and 9 GB of RAM. We’re using mothur 19.

Weâ€™re wondering what specifications are needed to run a dataset of this magnitude or larger (possibly 4-5 times larger). Any suggestions would be appreciated.

Bonnie and Diane

pschloss · May 21, 2011, 7:34pm

Are you sure that you are using the quality trimming and alignment filtering? One option is to try the cluster.split command. Classify all of your sequences that came out of pre.cluster and then run cluster.split using a taxonomic level of 2 or 3 and see how that works.

bpyoumans · May 24, 2011, 7:37pm

Thanks, that helped. We had already done quality trimming, alignment, chimera checking, etc using the same method as the Costello analysis. We’ll move on from here and see what happens.

Topic		Replies	Views
cluster large distance matric Commands in mothur	2	3403	February 8, 2011
Clustering a 10GB distance matrix mothur bugs	2	4197	March 16, 2011
hcluster, average neighbor, large distance matrix mothur bugs	1	4232	June 29, 2011
Issues with cluster command Commands in mothur	5	4450	December 19, 2012
cluster.split Commands in mothur	1	2526	August 25, 2011

Computer Issues with hcluster

Related topics