Cluster.split runtime problem

Hi, I am a new user of Mothur and following the mi-seq SOP. My data input for the cluster.split is ~1 million reads. It’s crashed after running for 3 days on 30 processors. Is there an alternative way to do it? Thank!

Did you unique.seqs and pre.cluster? I have a dataset of hundreds of soil samples that still had ~1M seqs after pre.clustering but was able to generate OTUs using 1 processor on a machine with 128GB RAM in 9 days.

Yes, I used unique.seqs before pre.cluster. I have 24 samples. I guess running hierarchical clustering on a 1M by 1M matrix takes a very long time. Does it make more sense to include uclust method as an option for large data?

cluster.split will work if you have enough RAM. Like I said, ONE processor that could access 128GB ram worked for a similar sized dataset. Multiple processors means multiple amounts of RAM. My big cluster job maxed out around 90GB RAM-had I tried to use 2 processors I would have needed 180GB RAM and the computer would have hung. Before my lab got the server with serious RAM, I tried using an SSD as swap to up the virtual RAM but that didn’t work it still hung. So rerun cluster split on a single processor and as much RAM and you can get your hands on

Just curious, what are you samples that you’re getting a million pre.clustered uniques with just 24 samples? I had hundreds of soils

It’s unlikely that a 1Mx1M matrix will ever make it through. You can try cluster.split, but I still have my doubts. You probably should know about this…

Hi everyone,

I’m not sure if I’m encountering the same problem. The cluster.split command is running but it’s been hanging on the clustering part for one fasta for two days, but since it hasn’t crashed I don’t want to click out of it. But I am unsure if this is a problem or not. I ran the command using 8 processors and my machine has 64gb of ram so maybe I should try with one processor? It’s hanging on the clustering of “cod index.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta.0.dist” and when I look at this file it’s 274gb large. Should I just kill this process or does it just take a lot of time?

I did have a large dataset but after quality control, and chimera removal etc I was working with 272,514 sequences. Obviously this is still a really large number of uniques. My data is from 68 samples from the codgut based on different feeding regimes.

Thanks for your help,


I’d let it sit. I’ve got datasets that take a week or more to run.