Clustering a large dataset

toneta · January 24, 2019, 10:31am

I’m trying to cluster my data, but end up with a lot of very large files, program stopped and being thrown out of the computing cluster due to overuse of memory.
What I try to cluster is the following

mothur > summary.seqs(fasta=TotIBDCHAR2.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=TotIBDCHAR2.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table)

Using 32 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	605	223	0	3	1
2.5%-tile:	1	609	252	0	3	1134795
25%-tile:	1	609	253	0	4	11347941
Median: 	1	609	253	0	4	22695882
75%-tile:	1	609	253	0	5	34043823
97.5%-tile:	1	609	253	0	6	44256969
Maximum:	2	609	275	0	8	45391763
Mean:	1	608	252	0	4
# of unique seqs:	422582
total # of seqs:	45391763

It took 19 secs to summarize 45391763 sequences.

I have tried cluster.split according to the MiSeqSOP With the same settings: splitmethod=classify, taxlevel=4, cutoff=0.03 (mothur/v.1.41.1)
Clustering with phylotypes works, but as I understand this is not the optimal solution. What else can I do?
I’m running my analysis on a computer cluster at the university where I can request working time. I can set --time=, --nodes=, --tasks= and --mem-per-cpu=. Any suggestions for the best request settings here to make the process run to the end?

pschloss · January 24, 2019, 7:15pm

Hmmm. It looks like you’re sequencing the V4 region, right? You have a ton of uniques. Maybe you could go to taxlevel=5 or taxlevel=6? What diffs are you using for pre.cluster? Perhaps you could try 3.

Regardless, I would request as much RAM and time on the cluster that you can get.

Pat

Kendra · January 24, 2019, 7:55pm

Why not use opticlust? Before that came out, I’d have to drop to taxlevel=5 for large soil datasets but even that doesn’t help if a huge number of the sequences are all the same group (i.e. unclassified Proteobacteria). I’ve clustered ~800k uniques from soils with opticlust and 256g ram.

toneta · January 25, 2019, 8:01am

You’re right, it’s the V4 region. I’m also pussled by the number of uniques. It’s all the same sample material (mucosal biopsies), processed in the same way, but sequenced in three batches.
I’ve been following the MiSeqSOP, so I’ve used diffs 2 for pre.cluster. I’ll try increasing as you suggested and also try different tax-levels.
Thank you for your advise!

Best,
Tone

toneta · January 25, 2019, 9:00am

Hi,
as I understand is opticlust what is used by mothur/v.1.41.1 in the cluster.split. Or maybe I have misunderstood something… Could you please share the commands you’re using for opticlust?

Best,
Tone

toneta · January 29, 2019, 1:17pm

Good news! I changed diffs to 3 in the pre.cluster step, the number of unique seqs were halved and I was able to cluster and get a shared file.
Thanks for all help!

Tone

system · February 8, 2019, 1:20pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Clustering OTUs Commands in mothur	5	1426	March 1, 2017
An error occurs while running pre.cluster command Commands in mothur	7	731	February 13, 2023
Using cluster.split with large data Commands in mothur	2	2699	March 31, 2014
Clustering for an low diversity, large dataset Commands in mothur	3	209	August 7, 2023
Cluster Commands in mothur	1	1131	August 5, 2015

Clustering a large dataset

Related topics