I’m trying to cluster my data, but end up with a lot of very large files, program stopped and being thrown out of the computing cluster due to overuse of memory.
What I try to cluster is the following
mothur > summary.seqs(fasta=TotIBDCHAR2.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=TotIBDCHAR2.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table)
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 605 223 0 3 1
2.5%-tile: 1 609 252 0 3 1134795
25%-tile: 1 609 253 0 4 11347941
Median: 1 609 253 0 4 22695882
75%-tile: 1 609 253 0 5 34043823
97.5%-tile: 1 609 253 0 6 44256969
Maximum: 2 609 275 0 8 45391763
Mean: 1 608 252 0 4
# of unique seqs: 422582
total # of seqs: 45391763
It took 19 secs to summarize 45391763 sequences.
I have tried cluster.split according to the MiSeqSOP With the same settings: splitmethod=classify, taxlevel=4, cutoff=0.03 (mothur/v.1.41.1)
Clustering with phylotypes works, but as I understand this is not the optimal solution. What else can I do?
I’m running my analysis on a computer cluster at the university where I can request working time. I can set --time=, --nodes=, --tasks= and --mem-per-cpu=. Any suggestions for the best request settings here to make the process run to the end?