Optimal resource request from computing cluster

amichaud · July 6, 2021, 5:25pm

Hi there! We submit mothur jobs to our computing cluster with PBS software where we specify the cpu and memory resources and wall time of the job. Home · BigelowLab/charlie Wiki · GitHub

I only know anecdotally the resources needed to process jobs with a certain number of samples. From the mothur development side, can you advise on what might be optimal for mothur as a program? I am starting to have some trouble with 80 miseq samples, which in my limited experience should be possible with our computing power. I don’t think this is always a more is better situation, but this is way out of my expertise. Thanks for the advise!

Cheers,
Alex

pschloss · July 8, 2021, 5:42pm

HI Alex

It’s a little hard to answer your question without more specifics, which is why we don’t really have any posted guidelines. Things like the read length, types of data (e.g. mouse, human, soil, marine?), region (eg. V4, V3V4), number of sequences per sample, number of samples, all factor in. Of course, we still strongly recommend using the 2x250 V2 chemistry to sequence the V4 region. For ~500 human samples, we generally use a node with 48 GB of RAM. I’m not sure if that’s helpful or not. Sometimes the need is more RAM, but sometimes it’s less RAM and more processors.

Let me know if you have any more details about your project and job specs.

Pat

amichaud · August 23, 2021, 8:40pm

Thanks, Pat. I dug into this a little more and it looks like it is the dist.seqs() command that is taking ~90% of the total job run time. Independent of project and job specs, are there suggestions to making dist.seqs() run more efficiently? I am running it currently as:

dist.seqs(fasta=current, cutoff=0.03)

with the current fasta file coming from remove.lineage → rf2106.paired.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta
Can I increase my cutoff here to something greater and still make OTUs at 0.03 later?
Thanks,
Alex

pschloss · August 24, 2021, 12:50pm

Thanks for following up. Are you using multiple processors? You might try skipping dist.seqs and use cluster.split instead. This will only calculate distances within a taxonomic group.

Finally, make sure that you’ve read this… Why do I have such a large distance matrix

Topic		Replies	Views
Processors used Theory behind mothur	7	10115	July 11, 2014
Cluster.split I/O bottleneck? Commands in mothur	4	1120	November 30, 2016
Clustering a large dataset Commands in mothur	6	1128	February 8, 2019
Mothur for large amount of data Feature requests	7	6778	September 26, 2013
RAM issue with clustering OTUs Commands in mothur	4	646	February 6, 2021

Optimal resource request from computing cluster

Related topics