Optimal resource request from computing cluster

Hi there! We submit mothur jobs to our computing cluster with PBS software where we specify the cpu and memory resources and wall time of the job. Home · BigelowLab/charlie Wiki · GitHub

I only know anecdotally the resources needed to process jobs with a certain number of samples. From the mothur development side, can you advise on what might be optimal for mothur as a program? I am starting to have some trouble with 80 miseq samples, which in my limited experience should be possible with our computing power. I don’t think this is always a more is better situation, but this is way out of my expertise. Thanks for the advise!

Cheers,
Alex

HI Alex

It’s a little hard to answer your question without more specifics, which is why we don’t really have any posted guidelines. Things like the read length, types of data (e.g. mouse, human, soil, marine?), region (eg. V4, V3V4), number of sequences per sample, number of samples, all factor in. Of course, we still strongly recommend using the 2x250 V2 chemistry to sequence the V4 region. For ~500 human samples, we generally use a node with 48 GB of RAM. I’m not sure if that’s helpful or not. Sometimes the need is more RAM, but sometimes it’s less RAM and more processors.

Let me know if you have any more details about your project and job specs.

Pat

Thanks, Pat. I dug into this a little more and it looks like it is the dist.seqs() command that is taking ~90% of the total job run time. Independent of project and job specs, are there suggestions to making dist.seqs() run more efficiently? I am running it currently as:

dist.seqs(fasta=current, cutoff=0.03)

with the current fasta file coming from remove.lineage → rf2106.paired.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta
Can I increase my cutoff here to something greater and still make OTUs at 0.03 later?
Thanks,
Alex

Thanks for following up. Are you using multiple processors? You might try skipping dist.seqs and use cluster.split instead. This will only calculate distances within a taxonomic group.

Finally, make sure that you’ve read this… Why do I have such a large distance matrix