Estimating memory needed for cluster.split

mgr · May 12, 2017, 11:20am

Hi!
Is there any way to estimate the amount of memory (and time) needed for cluster.split? I need to reserve the resources beforehand and I already run out of memory once, which terminated the process but used my quota anyway. I have altogether 339 samples. In total, there are 10870005 sequences. Maximum number of sequences in one sample is 901207.

Last time when I run out of memory (exceeded 30000 MB), I used the following command:
cluster.split(fasta=final.fasta, name= final.names, taxonomy=final.taxonomy, splitmethod=classify, method=nearest, taxlevel=4, cutoff=0.2, processors=4)

This time I am planning to run the following command instead:
cluster.split(fasta=final.fasta, name= final.names, taxonomy=final.taxonomy, splitmethod=classify, method=opti, taxlevel=4, cutoff=0.03, processors=4)

I believe the maximum total memory available for me is 256 GB and maximum number of cores is 16. However, the more I reserve, the more I use my quota and longer I need to wait in the queue…
I would be very glad if someone has even a rough estimate!

Kendra · May 12, 2017, 5:03pm

are you preclustering? what is the largest dist file you got out last time? that is how much ram you will need for that step, you will likely need to drop to 1 processor for the clustering step. you can split clustering into 2 steps - dist.seqs then cluster.

#make otus for each Order individually, for very large datasets (hundreds of samples) you may need to decrease the taxlevel to 5 or even 6. If you use 6 you will likely only get 3% OTUs because the within group differences aren't always 5%
#cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=16, cluster=f)
#if your data is very diverse you may end up with .dist files that are too large for the number of processers (if you are using 4 processors your dist file needs to fit in RAM 4 times. For my system, I have to drop the processors in this step if my largest .dist is over 64GB)
#cluster.split(file=current, processors=4)

mgr · May 13, 2017, 9:16am

Thank you so much for the response!
Yes, I have preclustered, removed chimeras and removed wrong lineage (not bacteria). I have used unique.seqs couple of times and at the moment there are 612 984 unique sequences in total.
The largest dist file I got from that run that exceeded the memory reservation is 287 624 MB. So it is not worth trying again even with only one core, I believe. I searched a bit and learned that there should be even larger amount of memory available but I need to ask details how to use it. Meanwhile, I can first do the splitting. So should it be enough to use 4000 MB/core and 16 cores to run this step?:
cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=16, cluster=f)
Any idea how long it could take? The longer time I reserve, the longer I need to wait in the queue :?

pschloss · May 15, 2017, 12:42pm

First, to get that many unique reads, I suspect you have sequence data with a high error rate. See this discussion: http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/

Next, make sure you are using mothur v1.39.5. Then use cutoff=0.03 and if necessary use taxlevel=5.

Pat

Kendra · May 15, 2017, 3:43pm

oops, need to change my cutoff. You probably won’t be able to use more than one processor with 4GB ram (right? 4000mb ram is 4 gb). I have 256GB ram and use 4 cores. For the clustering step, you must be able to load the entire largest dist file in to RAM

Topic		Replies	Views
cluster and cluster.split Commands in mothur	8	6901	September 18, 2013
Memory requirements for clustering 61GB distance file Commands in mothur	9	9908	December 15, 2011
Cluster.split I/O bottleneck? Commands in mothur	4	1130	November 30, 2016
cluster.split mpi vs non-mpi RAM issues Commands in mothur	4	1988	August 16, 2015
cluster.split running out of RAM mothur bugs	1	3094	February 25, 2014

Estimating memory needed for cluster.split

Related topics