Pre.cluster taking super long

Good day.
I am busy with an ITS workflow and am stuck at the pre.cluster step. I skipped the trimming step, because the guys at our university’s sequencing institute already did the trimming step. In the workflow that I’m following, it calls for a group file produced from the trimming step as in input for the pre.cluster step. Considering I didn’t do that step, I manually made a group file with make.groups. I used that as my input for pre.cluster, but it gave me too many errors. I then opted for the name file instead. It’s been running for over ten days now and it’s running on our University’s cluster (I tried on my laptop first). Surely something’s not right here? Is there a way to make it run faster than that without a group file?
Otherwise, I think it might be something to do with the group file that I manually made, because the count.seqs command doesn’t work with my group file, but only with my name file. However, when I do the count.groups command, all my groups appear with the number of sequences in each.
I’m really stuck in a rut here, so I would kindly appreciate anyone’s help regarding this matter.

It’s possible the long time is real. ITS sequences are likely to have a high sequencing error rate because the reads are unlikely to fully overlap. You don’t necessarily need a groups file, but you want to double check that you’ve run unique.seqs to reduce the complexity of the dataset. If you can give more details about what you’ve done (e.g. # of reads, sequencing platform, type of samples, etc), it might be possible to help you further.


Dear Pat
Many thanks for your response. I have already completed the unique.seqs command step. I am attaching the output of the summary.seqs for that here. Image 7-27-20 at 2.16 PM. The sequencing platform was done on illumina and the samples are from soils underneath plants treated with herbicides and heat.
I think it is important to note that when I received the samples on Basespace, they were all in separate folders (I am working with 96 samples). I then copied them all in the same working directory and ran make.contigs for each sample individually. I then merged all the outputs from the make.contigs with merge.files.
I hope that helps

I fear the problem is that there is a high error rate associated with read pairs that don’t fully overlap. I’m not sure it would matter, but you should be able to put all of your fastq / fastq.gz files into the same directory and then use make.files to generate the input to make.contigs.


Dear Pat.
Thanks for getting back to me again. I actually did exactly that about two days ago; I made one big contigs after using the make.files command for my folder. It seems to be running much faster now.


This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.