Pairwise.seqs taking long with ITS

Good day
I have a frustrating little issue with the pairwise.seqs command. The command it just taking extremely long to run. Normally it doesn’t run longer than two days on my local machine, but this one that I’m trying to run took several days on my local machine. I therefore submitted it as a job on my university’s cluster. I ran out my my week’ s time last night. On Friday I submitted another job and set the processors to 46. That job is still running.
This is for an ITS dataset and the command that I’m using is:
mothur > pairwise.seqs(fasta=final.fasta, cutoff=0.05)
and for my jobsubmission:
pairwise.seqs(fasta=final.fasta, cutoff=0.05, processors=48)

If I go and check in the folder that the job is running in, I see that there is already a final.dist file, but I’m assuming this file is not ready to work with until the job has finished running?

This problem is seriously affecting my timeline for my worflow and I would therefore kindly appreciate any insight regarding this matter.

Best
Nicolas

Hi,

I’m sorry it’s taking so long. How many sequences do you have to analyze? I’m not sure that we can speed it up any for you, but suspect you have a lot more sequences than you typically do.

Thanks,
Pat

Thanks for getting back to me on this, Pat. My final fasta file has 10370208 seqeunces. Yes, I think this is a bit more than I normally have.

Best
Nicolas

Sorry - with over 10 million unique sequences, I’m afraid it’s going to be slow going

Pat

And, please notice that your distances file is going to be huge - likely several TB.

Thanks for the responses, Pat and Leocadio.
That is good to know. Would you say that those numbers look a bit unusual for 300 soil samples? This is just for ITS

I think that, without doing any denoising, that wouldn’t be that strange

Thanks for the feedback, everyone