Hi mothur forum,
mothur is really helpful but how to know when the pc can’t handle cluster.split?
I’m running it with column, count, taxonomy and splitmethod classify. The distance matrix is ca 70 GB. The largest temp file is ca half that size now. I wonder if my pc (standard 8 GB RAM) can handle this or I have to use a fasta file with cluster.split instead, or perhaps cluster using phylotypes.
Hi mothur forum,
It would likely run much faster if you gave it a faster file rather than a distance matrix. That being said, 8 GB isn’t much and windows always seems to run slower than running it on a Mac or Linux computer. Can you run it on a high performance computer cluster at your institution or on Amazon?
Thanks, yes I could try a high performance computer cluster. But before doing that, is there a rule of thumb to indicate RAM needed? For example, how much RAM does the PC have to be equipped with to process a 10, 50 or 100 GB distance matrix?
Sorry, but it’s hard to say. I first would try running it with a fasta file and see what happens. I just noticed you said you had a 70GB distance matrix, which is quite large and I worry that it might not be possible to run through cluster. See this blog post for a partial explanation and some thoughts https://mothur.org/blog/2014/Why-such-a-large-distance-matrix/
I’m taking advantage of this topic to shoot my question concerning the RAM requirements of cluster.split function.
If the largest temp.dist matrix is 70GB, it means that the cluster will use 70GB RAM to load it, cluster it, and then move on to the next temp.dist file. Am I correct? If understand well your answer Pat, there is no way to know what would be the “extra” RAM needed to process the clustering once the matrix fully loaded?
I’m asking because I performed the cluster.split on my data and the largest distance matrix I get is 230GB. Since my cluster has 22 cores, 245 GB RAM + 64 swap, I presumed that it should have worked, yet the process kept filling my RAM until finally got killed.
Thank you for your help,
sorry, I do not know whether RAM required equal file size. Looking briefly over my notes I see and now remember my cluster.split ran slowly over four days to make the temp files. Then a few days into the clustering cluster.split stopped. I solved it by using a university server with much more RAM. Can you do the same? Also, I wonder why your matrix is so big. Can you reduce its size somehow? (for example by removing some sequences, make sure only unique sequences are included).
I’m pretty confident that you won’t be able to do anything with a 230 GB distance matrix. I’d encourage you to read the blog post I linked to above. You’ll probably be limited to using the phylotype-based approach.