pre.cluster taking a long time

I’m running the Miseq SOP on mothur version 1.39.5. I’m running multiple processors as there are >1000 read files in my analysis.I’ve run the following commands (up to the pre.cluster command in the SOP) without hassle:

make.file(inputdir=~/endv_mproc, type=fastq, prefix=stability)
make.contigs(file=stability.files, processors=64)
screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, minlength=371, maxambig=0, maxlength=420)
count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4v5.fasta)
summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)
screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=3986, end=17783, maxhomop=8)
filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=., processors=64)
unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)
pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)

Once it reaches pre.cluster everything slows right down (it took around 50 hours to complete). I’ve checked my CPU usage and it seems to be using mulitple processors before the pre.cluster command then dropping back to a single processor for pre.cluster. My log file shows:

mothur > pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)

Using 64 processors.

Processing group endv:
2271040 595941 1675099
Total number of sequences before pre.cluster was 2271040.
pre.cluster removed 1675099 sequences.

It took 189282 secs to cluster 2271040 sequences.
It took 189480 secs to run pre.cluster.

I’m not sure if there may be a problem with my hardware not using processors correctly for this command, or whether I’ve input something incorrectly, or if it’s normal for the command to take this long with loads of sequence files?

Many thanks in advance for any advice!


So the issue discussed in this blog post may explain why your preclustering is taking so long.


Thanks, Richard. I’m getting a 404 error for the link - could you post the url?
Many thanks!

Hi again, Richard. I managed to find the blog you mentioned and agree there’s lots of good info in there.

My question is more about the drop in CPU processing that I’m experiencing though. I’ve requested pre.cluster to run on 64 processors. While the mothur log file says that 64 processors are being used, the CPU usage stats look like only a single processor is being activated. Given how long the command is taking to run, I’m wondering if there is something wrong in my code, or something else I should be doing, to get more than one processor working on this command.

Are you able to advise around this issue?

Many thanks in advance!

Processing group endv:
2271040 595941 1675099
Total number of sequences before pre.cluster was 2271040.
pre.cluster removed 1675099 sequences.

It took 189282 secs to cluster 2271040 sequences.
It took 189480 secs to run pre.cluster.

This tells me that you only have one group - endv. The parallelization of pre.cluster involves putting each sample onto a different processor. If you only have one sample, no matter how many processors you request, it will only use one.


Many thanks, Pat.

My project name is endv (name of the analysis directory) and it includes read files from around 700 samples. These are spitting out nicely in my shared files etc so I’m not sure whether there is something I need to do so that pre.cluster recognises the multiple groups?

I tried adding the group file to pre.cluster, but it gave an error when I tried to use it together with the count option. I really like using the count option, so defaulted back to that without including the group file (assuming the program would grab it if needed), but it’s consistently dropping back to one processor. I’m not sure what else I should try.

On the good side, I’ve left the mothur running and it’s pushed through the subsequent commands happily. I’m just hoping to find out what I can do to multi-processor pre.cluster to reduce wall time on future runs.

Revision to the above…the post pre.cluster files in my trial run (just 20 read files) worked fine, but yep, the version spitting out atm lists a single group called “endv”. So, I’m not sure if I’ve ruined something earlier in the text that is preventing the groups being recognised correctly?

So, from the logfile, it looks like the error is occurring back with make.contigs. The output (below) suggests make.contigs is finding the read file pairs ok, but is outputting the data as a single group. I’m not sure how to revise make.contigs to prevent this happening.

input command:
mothur > make.contigs(file=stability.files, processors=32)

Eg of read pairs being found:

Processing file pair /home/weebeige/endv_mproc/UWGDC2_S169_L001_R1_001.fastq - /home/weebeige/endv_mproc/UWGDC2_S169_L001_R2_001.fastq (files 595 of 598) <<<<<
Making contigs…

It took 4 secs to assemble 22603 reads.

outputs listed in log file:
Group count:
endv 17067779

Total of all groups is 17067779

Output File Names:


Thanks for pointing out it was a single group, Pat. It turned out that make.contigs was writing all the samples to a single group and, yep, this caused some issues. The source of the make.contigs error turned out to be an underscore in the name of my analysis directory (damn you, underscore!). Once that was removed, the make.contigs error went away, the groups were written correctly, and pre.cluster went to multiple processors (completing in ~25mins).

Many thanks to all who replied to help with sorting this out!