RAM needed for cluster.split

sje062 · December 1, 2020, 11:56am

Hi mothur forum,
mothur is really helpful but how to know when the pc can’t handle cluster.split?
I’m running it with column, count, taxonomy and splitmethod classify. The distance matrix is ca 70 GB. The largest temp file is ca half that size now. I wonder if my pc (standard 8 GB RAM) can handle this or I have to use a fasta file with cluster.split instead, or perhaps cluster using phylotypes.
Sigmund

pschloss · December 3, 2020, 5:25pm

It would likely run much faster if you gave it a faster file rather than a distance matrix. That being said, 8 GB isn’t much and windows always seems to run slower than running it on a Mac or Linux computer. Can you run it on a high performance computer cluster at your institution or on Amazon?

Pat

sje062 · December 7, 2020, 3:08pm

Thanks, yes I could try a high performance computer cluster. But before doing that, is there a rule of thumb to indicate RAM needed? For example, how much RAM does the PC have to be equipped with to process a 10, 50 or 100 GB distance matrix?
Sigmund

pschloss · December 10, 2020, 1:50pm

Sorry, but it’s hard to say. I first would try running it with a fasta file and see what happens. I just noticed you said you had a 70GB distance matrix, which is quite large and I worry that it might not be possible to run through cluster. See this blog post for a partial explanation and some thoughts https://mothur.org/blog/2014/Why-such-a-large-distance-matrix/

Gschwob · November 17, 2021, 1:29pm

Hi there,
I’m taking advantage of this topic to shoot my question concerning the RAM requirements of cluster.split function.
If the largest temp.dist matrix is 70GB, it means that the cluster will use 70GB RAM to load it, cluster it, and then move on to the next temp.dist file. Am I correct? If understand well your answer Pat, there is no way to know what would be the “extra” RAM needed to process the clustering once the matrix fully loaded?
I’m asking because I performed the cluster.split on my data and the largest distance matrix I get is 230GB. Since my cluster has 22 cores, 245 GB RAM + 64 swap, I presumed that it should have worked, yet the process kept filling my RAM until finally got killed.

Thank you for your help,

Guillaume

sje062 · November 17, 2021, 2:42pm

Hi,
sorry, I do not know whether RAM required equal file size. Looking briefly over my notes I see and now remember my cluster.split ran slowly over four days to make the temp files. Then a few days into the clustering cluster.split stopped. I solved it by using a university server with much more RAM. Can you do the same? Also, I wonder why your matrix is so big. Can you reduce its size somehow? (for example by removing some sequences, make sure only unique sequences are included).
Sigmund

pschloss · November 30, 2021, 5:51pm

Hi,

I’m pretty confident that you won’t be able to do anything with a 230 GB distance matrix. I’d encourage you to read the blog post I linked to above. You’ll probably be limited to using the phylotype-based approach.

Pat

Topic		Replies	Views
Memory requirements for clustering 61GB distance file Commands in mothur	9	9872	December 15, 2011
problem with cluster.split...? mothur bugs	2	3073	December 29, 2014
cluster.split making TB of temp files Commands in mothur	3	1419	September 30, 2016
cluster.split(processors=32) regenerating large *temp? mothur bugs	1	1784	July 18, 2016
Command cluster_issue	17	885	November 28, 2021

RAM needed for cluster.split

Related topics