Pre.cluster taking super long

Nicolas · July 27, 2020, 3:46pm

Good day.
I am busy with an ITS workflow and am stuck at the pre.cluster step. I skipped the trimming step, because the guys at our university’s sequencing institute already did the trimming step. In the workflow that I’m following, it calls for a group file produced from the trimming step as in input for the pre.cluster step. Considering I didn’t do that step, I manually made a group file with make.groups. I used that as my input for pre.cluster, but it gave me too many errors. I then opted for the name file instead. It’s been running for over ten days now and it’s running on our University’s cluster (I tried on my laptop first). Surely something’s not right here? Is there a way to make it run faster than that without a group file?
Otherwise, I think it might be something to do with the group file that I manually made, because the count.seqs command doesn’t work with my group file, but only with my name file. However, when I do the count.groups command, all my groups appear with the number of sequences in each.
I’m really stuck in a rut here, so I would kindly appreciate anyone’s help regarding this matter.

pschloss · July 27, 2020, 6:04pm

It’s possible the long time is real. ITS sequences are likely to have a high sequencing error rate because the reads are unlikely to fully overlap. You don’t necessarily need a groups file, but you want to double check that you’ve run unique.seqs to reduce the complexity of the dataset. If you can give more details about what you’ve done (e.g. # of reads, sequencing platform, type of samples, etc), it might be possible to help you further.

Pat

Nicolas · July 27, 2020, 6:39pm

Dear Pat
Many thanks for your response. I have already completed the unique.seqs command step. I am attaching the output of the summary.seqs for that here. Image 7-27-20 at 2.16 PM . The sequencing platform was done on illumina and the samples are from soils underneath plants treated with herbicides and heat.
I think it is important to note that when I received the samples on Basespace, they were all in separate folders (I am working with 96 samples). I then copied them all in the same working directory and ran make.contigs for each sample individually. I then merged all the outputs from the make.contigs with merge.files.
I hope that helps
Thanks!

pschloss · July 30, 2020, 12:57pm

I fear the problem is that there is a high error rate associated with read pairs that don’t fully overlap. I’m not sure it would matter, but you should be able to put all of your fastq / fastq.gz files into the same directory and then use make.files to generate the input to make.contigs.

Pat

Nicolas · July 30, 2020, 7:15pm

Dear Pat.
Thanks for getting back to me again. I actually did exactly that about two days ago; I made one big contigs after using the make.files command for my folder. It seems to be running much faster now.

Cheers

system · August 9, 2020, 7:15pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
pre.cluster command not working mothur bugs	3	4300	January 28, 2015
An error occurs while running pre.cluster command Commands in mothur	7	733	February 13, 2023
Pre.cluster is very lslow and the fasta file which it produce is blank	6	551	February 20, 2024
pre.cluster with fasta and name file - a faster implementation? Feature requests	0	1959	September 5, 2016
Issue with pre.cluster Commands in mothur	10	543	October 30, 2023

Pre.cluster taking super long

Related topics