I’m trying to cluster my data, but end up with a lot of very large files, program stopped and being thrown out of the computing cluster due to overuse of memory.
What I try to cluster is the following
I have tried cluster.split according to the MiSeqSOP With the same settings: splitmethod=classify, taxlevel=4, cutoff=0.03 (mothur/v.1.41.1)
Clustering with phylotypes works, but as I understand this is not the optimal solution. What else can I do?
I’m running my analysis on a computer cluster at the university where I can request working time. I can set --time=, --nodes=, --tasks= and --mem-per-cpu=. Any suggestions for the best request settings here to make the process run to the end?
Hmmm. It looks like you’re sequencing the V4 region, right? You have a ton of uniques. Maybe you could go to taxlevel=5 or taxlevel=6? What diffs are you using for pre.cluster? Perhaps you could try 3.
Regardless, I would request as much RAM and time on the cluster that you can get.
Why not use opticlust? Before that came out, I’d have to drop to taxlevel=5 for large soil datasets but even that doesn’t help if a huge number of the sequences are all the same group (i.e. unclassified Proteobacteria). I’ve clustered ~800k uniques from soils with opticlust and 256g ram.
You’re right, it’s the V4 region. I’m also pussled by the number of uniques. It’s all the same sample material (mucosal biopsies), processed in the same way, but sequenced in three batches.
I’ve been following the MiSeqSOP, so I’ve used diffs 2 for pre.cluster. I’ll try increasing as you suggested and also try different tax-levels.
Thank you for your advise!
Hi,
as I understand is opticlust what is used by mothur/v.1.41.1 in the cluster.split. Or maybe I have misunderstood something… Could you please share the commands you’re using for opticlust?
Good news! I changed diffs to 3 in the pre.cluster step, the number of unique seqs were halved and I was able to cluster and get a shared file.
Thanks for all help!