How to reduce the time in pairwise.seq

Hzsun · June 18, 2021, 12:18am

I am running pairwise.seq command with fungal sequences and would have your advice. The problem is that the dataset is too large and the command takes too long time even I used 40 processors. Time is out after 10 days…the info. of the dataset is:

 Start End  NBases     Ambigs     Polymer    NumSeqs

Minimum: 1 230 230 0 3 1
2.5%-tile: 1 230 230 0 3 650365
25%-tile: 1 230 230 0 4 6503644
Median: 1 230 230 0 5 13007288
75%-tile: 1 230 230 0 5 19510931
97.5%-tile: 1 230 230 0 7 25364210
Maximum: 1 230 230 0 8 26014574
Mean: 1 230 230 0 4

of unique seqs: 489119

total # of seqs: 26014574

My question is: is there anyway solve this problem?

Many thanks,

Hui

sje062 · June 18, 2021, 6:56am

Hi Hzsun,
myself I have used the command dist.seqs. I guess your goal is simlar; to get a distance matrix. I may be wrong but my suggestion is to reduce workload. Run the command on only unique sequences, set cutoff to 0.10, think about what information you need your distance matrix to store and limit your workload to only this.

Sigmund

Hzsun · June 18, 2021, 8:03am

Hei, Sigmund,
Thanks for the suggestion. I guess dist.seq should be working as well. I also set up the cutoff as 0.10 in pairwise.seq command, in which the unique sequences should be used. I just want to confirm if the dist.seq would be faster than pairwise.seq to handle the same dataset.
Thanks

Hui

pschloss · June 18, 2021, 6:55pm

If the sequences are ITS and they are not aligned, then dist.seqs will not work. You’ll have to use pairwise.seqs. Sorry it is slow - we know and are working on some solutions, but they won’t be ready for a while yet. I’m afraid that with 490k sequences, any distance matrixx you generated would be gigantic. How did you get all of your ITS sequences to be 230 nt?

Pat

Hzsun · June 19, 2021, 4:31am

Hei, Pat,

Thanks for the information. We used ITS1 region for sequences and due to the large number of samples and huge dataset, we trimmed all the sequences to 230nt. This for sue affects the downstream analysis.

If it is possible to set up cutoff=0.03 in pair.wise command, which we are only interested ?
Thanks,

Hui

pschloss · June 21, 2021, 8:08pm

Hi Hui - you should be able to set cutoff=0.03

Pat

Topic		Replies	Views
Pairwise.seqs taking long with ITS	7	548	December 24, 2020
Applying pairwise.seqs for ITS1-ITS2 Commands in mothur	6	51	January 16, 2025
Are my pairwise.seqs results normal? Commands in mothur	2	668	July 7, 2019
dist.seq- taking lot of disk space Commands in mothur	1	1208	January 28, 2016
Distance matrix size advice Commands in mothur	7	14036	February 12, 2010

How to reduce the time in pairwise.seq

of unique seqs: 489119

Related topics