Hello,
I just started using mothur a month ago, and the wiki and forum have been very useful resources as I learn my way around, so thanks all for maintaining those.
As background, I’m running mothur on an 8-processor Mac with OS 10.6 and 18 GB of RAM and 60 GB of swap memory on a SSD. I’m following the example presented in the Costello stool analysis to analyze environmental bacterial 454 sequences, and have a 0.25 distance matrix of 45,000 sequences. Since the file is 9.7 GB and hcluster is read/write intensive, I moved the matrix file onto the SSD and started hcluster; however, after 4 days it didn’t yield any output. Then since I had available RAM and swap memory, and no success with hcluster, I tried using the cluster command (average neighbor, 0.03 cutoff). Reading it into memory and doing the clustering procedure took up 16 GB of RAM plus 26 GB of swap memory; I started this Wednesday and terminated it today (run time of 5 days) because it didn’t progress any further after finishing the ‘unique’ on day 1. The commands for those procedures are below. My questions are:
1). We decided to use the SSD to minimize read/write time, so we allocated 60GB of swap memory on it. It is working as swap memory, except it didn’t seem to help the 18GB of RAM with clustering. Would this swap memory have been able to ‘trick’ mothur into thinking I had more RAM?
2). Though the file size is almost 10 GB, I have enough RAM available to read it into memory. Do you know why neither cluster nor hcluster worked? Or should I have given either/both more time?
3). Using a 0.10 distance matrix was a smaller file size but changed my cutoff. Should I try making a 0.15 or 0.20 distance matrix, to get the final clustering cutoff of 0.03?
Thanks and cheers,
Sharon
mothur > dist.seqs(fasta=B.1233.fasta, cutoff=0.10, processors=8)
…
Output File Name:
B.1233.dist
It took 2042 to calculate the distances for 44529 sequences.
I changed the filename to B.1233.10.dist.
mothur > read.dist(column=B.1233.10.dist, name=B.1233.names)
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
It took 607 secs to read
mothur > cluster(method=average, cutoff=0.03)
changed cutoff to 0.0212287
Output File Names:
B.1233.10.an.sabund
B.1233.10.an.rabund
B.1233.10.an.list
It took 71663 seconds to cluster
I tried with a 0.25 distance matrix.
mothur > dist.seqs(fasta=B.1233.fasta, cutoff=0.25, processors=8)
…
Output File Name:
B.1233.dist
It took 2589 to calculate the distances for 44529 sequences.
I changed the filename to B.1233.25.dist.
hcluster(column=B.1233.25.dist, name=B.1233.names, method=average, cutoff=0.03)
With no progress after 4 days, I terminated the command and changed tactics.
mothur > read.dist(column=B.1233.25.dist, name=B.1233.names)
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
It took 1311 secs to read
mothur > cluster(method=average, cutoff=0.03)
quitting command…
With no progress after 5 days, I terminated the command.