Feedback on a pre.cluster issue workaround for processing ITS sequences

Greetings,

I am emailing today to see if I can get your opinion for a work around I am trying to do for the pre.cluster step of my ITS analysis. To start, I am trying to use an approach I found on the mothur forum which links out to github.

I have made it to the pre.cluster step but my universities hpc will only allow me to use 240 walltime hours (with 28 processor @ 6gb ram each) or 6720 cpu hours per job request. Even using the maximum amount of time and processing power I have not been able to successfully run the pre.cluster command.
After running count.seqs I have the following (for 32 groups):
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 20 20 0 2 1
2.5%-tile: 1 269 269 0 4 83337
25%-tile: 1 270 270 0 5 833367
Median: 1 271 271 0 5 1666733
75%-tile: 1 271 271 0 6 2500099
97.5%-tile: 1 271 271 0 7 3250128
Maximum: 1 273 273 0 8 3333464
Mean: 1 270.5 270.5 0 5.41249

of unique seqs: 987708

total # of seqs: 3333464

Though not as large after the unique.seqs command, my group file(s) have the following number of reads:
Group count:
12.10D 22446
12.10F 215029
12.5D 32817
12.5F 155475
13.10D 20321
13.10F 73439
13.5D 16710
13.6F 34876
14.10D 18798
14.10F 29129
14.5.1D 14679
14.5.1F 54480
15.10D 69649
15.10F 29996
15.5D 48211
15.6F 79386
35.10D 59684
35.10F 16362
35.5D 34356
35.6F 156772
36.10D 143031
36.5D 179636
36.5F 279958
36.8F 170138
38.10D 48443
38.10F 5374
38.5D 165985
38.6F 258187
39.10D 718781
39.5D 32462
39.6F 124186
39.7F 26051

*One of my biggest hurdles is the group 39.10D with 720k reads in it, though this goes to 112k after the unique.seqs command.

So I decided to try a work around where I split the data by their respective groups (this generates a .fasta and .count_table for each group) and run pre.cluster on them individually. My plan was to merge the output files once I was done. This for the most part has been working successfully, however, I only have access to 30k hours of computational time a month and have exhausted it already, not nearly being done with all of the groups. Since I would prefer not to wait until my monthly allocation is renewed I started looking around for ways to cluster the individual data prior to running the pre.cluster command, thinking that maybe this would help save me some time and computational power.
I found a program called MAFFT (Multiple alignment program for amino acid or nucleotide sequences).

https://mafft.cbrc.jp/alignment/server/clustering.html

This program has tool which can ‘roughly’ cluster unaligned sequences. I used the program to ‘rough’ cluster my largest file, the 39.10D and then used the output (which retained all of the reads) in the pre.cluster command. Prior to the ‘rough’ cluster my summary looked like this:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 28 28 0 2 1
2.5%-tile: 1 271 271 0 4 17965
25%-tile: 1 271 271 0 5 179642
Median: 1 271 271 0 5 359283
75%-tile: 1 271 271 0 5 538924
97.5%-tile: 1 272 272 0 5 700600
Maximum: 1 273 273 0 8 718564
Mean: 1 270.942 270.942 0 4.96246

of unique seqs: 111481

total # of seqs: 718564

After running the ‘rough’ cluster my summary looked like this:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 2033 28 0 2 1
2.5%-tile: 63 2481 270 0 4 2788
25%-tile: 64 2481 271 0 4 27871
Median: 64 2481 271 0 5 55741
75%-tile: 64 2481 271 0 5 83611
97.5%-tile: 313 2481 272 0 7 108694
Maximum: 2345 2510 273 0 8 111481
Mean: 79.6353 2480.99 270.714 0 4.85969

of Seqs: 111481

Using the ‘rough’ clustered .fasta and newly generated .count_table (list and get.seqs derived), I was able to process the sample in 1780 seconds whereas using the non-‘rough’ clustered input I ran out of time at 150 hours of wall time and 4200 hours of cpu time. A huge difference.

Processing group 39.10D:
111481 59883 51598
Total number of sequences before pre.cluster was 111481.
pre.cluster removed 51598 sequences.

It took 1780 secs to cluster 111481 sequences.
It took 1794 secs to run pre.cluster

So short of the long, what is your opinion or feelings about this approach? Do you think it cause issues downstream? I plan on using this one data set to run through the rest of the mothur SOP just as a test, but I figured I would ask the professionals what they think as well.

Thank you in advance for your time.

Patrick

ug I’ve run into this sort of constraint from server people as well. 6 gb ram is not enough. Have you tried meeting with them and explaining that you don’t need massive wall time, you need massive RAM for a short(ish) time.

I just looked at the last big fungal project that I analyzed for someone to give you some ideas on time. Our high mem cores are 32 nodes with 512GB RAM, so I used 32 nodes with 512gb ram. The project started with 12M seqs across 215 samples (22 samples had >100k seqs). major time steps:
make.contigs ~3.5 hrs
9.7M went into pre.cluster which took ~54 hrs
chimera.vsearch checking of 1.3M “uniques” took another 16 hrs
classifying ~4hrs
clustering ~3 hrs

I probably should have randomly subsampled the samples with >100k seqs but I haven’t bothered doing that to date.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.