Clustering a large dataset


I’m trying to cluster my data, but end up with a lot of very large files, program stopped and being thrown out of the computing cluster due to overuse of memory.
What I try to cluster is the following

mothur > summary.seqs(fasta=TotIBDCHAR2.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=TotIBDCHAR2.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table)

Using 32 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	605	223	0	3	1
2.5%-tile:	1	609	252	0	3	1134795
25%-tile:	1	609	253	0	4	11347941
Median: 	1	609	253	0	4	22695882
75%-tile:	1	609	253	0	5	34043823
97.5%-tile:	1	609	253	0	6	44256969
Maximum:	2	609	275	0	8	45391763
Mean:	1	608	252	0	4
# of unique seqs:	422582
total # of seqs:	45391763

It took 19 secs to summarize 45391763 sequences.

I have tried cluster.split according to the MiSeqSOP With the same settings: splitmethod=classify, taxlevel=4, cutoff=0.03 (mothur/v.1.41.1)
Clustering with phylotypes works, but as I understand this is not the optimal solution. What else can I do?
I’m running my analysis on a computer cluster at the university where I can request working time. I can set --time=, --nodes=, --tasks= and --mem-per-cpu=. Any suggestions for the best request settings here to make the process run to the end?


Hmmm. It looks like you’re sequencing the V4 region, right? You have a ton of uniques. Maybe you could go to taxlevel=5 or taxlevel=6? What diffs are you using for pre.cluster? Perhaps you could try 3.

Regardless, I would request as much RAM and time on the cluster that you can get.



Why not use opticlust? Before that came out, I’d have to drop to taxlevel=5 for large soil datasets but even that doesn’t help if a huge number of the sequences are all the same group (i.e. unclassified Proteobacteria). I’ve clustered ~800k uniques from soils with opticlust and 256g ram.


You’re right, it’s the V4 region. I’m also pussled by the number of uniques. It’s all the same sample material (mucosal biopsies), processed in the same way, but sequenced in three batches.
I’ve been following the MiSeqSOP, so I’ve used diffs 2 for pre.cluster. I’ll try increasing as you suggested and also try different tax-levels.
Thank you for your advise!



as I understand is opticlust what is used by mothur/v.1.41.1 in the cluster.split. Or maybe I have misunderstood something… Could you please share the commands you’re using for opticlust?



Good news! I changed diffs to 3 in the pre.cluster step, the number of unique seqs were halved and I was able to cluster and get a shared file.
Thanks for all help!

Tone :smiley:

closed #7

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.