Make.contigs runs out of memory

Shaun · May 5, 2025, 9:24am

Using mothur v.1.48.0, I’m running out of memory with make.contigs and i did not expect it to use much memory. I have 612 samples with roughly 300 K reads each. It seemed to fail at the end, where i assume the count_table was being created (I could see the “Processing file pair” for all samples in my stdout). This failed with 128Gb, i tried 4 to 16 processors, and had ‘out of memory’ error for all attempts.

I ended up splitting the data into 6 subsets (6 being arbitrary), making contigs and then merging the files together. However, I had issues merging the 6 count_files, and again, used a split-combine approach, where i merged count_tables 1-2-3, and 4-5-6, and then merged those final 2 together. This ended up working, and appeared to used about 90Gb memory.

I guess my question is: why would it fail with the full data? and why does the split-combine approach work? I did not think this was a memory intensive step? Indeed, summary.seqs is now failing

From the successful split and combine job (make.contigs x6, merge x2 and merge x 1, count.groups)

# JOB EFFICIENCY REPORT
...
Memory Utilized: 86.19 GB
Memory Efficiency: 89.78% of 96.00 GB

File sizes

67G escala.trim.contigs.fasta
9.3G escala.contigs.count_table

Final count groups

Aaer_01a contains 363922.
...
W3_2404 contains 159172.

Size of smallest group: 81075.

Total seqs: 204,931,454.

Contig length seems ok

mothur > summary.seqs(fasta=escala.trim.contigs.5000.fasta)

Using 128 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       259     259     0       3       1
2.5%-tile:      1       292     292     0       4       126
25%-tile:       1       292     292     0       4       1251
Median:         1       292     292     0       5       2501
75%-tile:       1       292     292     0       5       3751
97.5%-tile:     1       293     293     3       8       4876
Maximum:        1       500     500     33      250     5000
Mean:   1       292     292     0       5
# of Seqs:      5000

It took 0 secs to summarize 5000 sequences.

westcott · May 5, 2025, 4:21pm

why would it fail with the full data?

The full dataset is likely too large for the memory allocated.

and why does the split-combine approach work? I did not think this was a memory intensive step? Indeed, summary.seqs is now failing

The split approach uses less memory, but if summary.seqs is failing then the final merge might not have had enough memory to complete. An incomplete / corrupted count file would like cause an out of memory or segfault error. Alternatively, if it did complete the merge, you don’t have enough memory to store the fasta file and count file in memory.

Shaun · May 5, 2025, 4:58pm

My naive understanding of the count table generation is that it is “append-like” - adding new data to a previous file ? I.e. make contigs of a sample and then add seq name and count to the count file sequentially (add column and rows). Like from how i can see the count table increase in size with other commands before they complete .. hence i assumed it wasnt memory intensive .. i do not know the inner workings of the functions and have only made assumptions - they must be incorrect.

I find with bigger datasets there is a balance between processors and allocated memory across the different steps of the pipeline. Are such things discussed in the MiSeq SOP? Like giving a ballpark estimates on CPU vs Mem requirements for each step - “This step is memory intensive so opt for high memory per cpu”, “This step is not memory intensive so opt for more processors” than actual specific values ..

pschloss · May 6, 2025, 12:08pm

Hi Shaun,

I’m afraid that a bigger problem for you is that you are generating a ton (300k reads per sample is about 20-fold more than typical) of contigs that are 300 nt long. This means that you either don’t have fully overlapping reads or that you’re using the 2x300 chemistry. Both will have a very high error rate making most things appear as unique sequences. This is why we strongly recommend using 2x250 chemistry to get fully overlapping reads that cover the V4 region. For processes like creating the count file, memory is required to create the file since we’re adding columns to the data and not merely adding data as rows.

You can see more here…

Illumina’s quality hasn’t improved in the 11 years since I wrote that post

Pat

Shaun · May 6, 2025, 12:42pm

Thanks Sarah and Pat!

I’ve been out of the game for almost 10 years and am helping on this project. Unfortunately, the design was out of my hands. I do remember this post of yours. Shame that the technology has have not improved.

Indeed, number of unique.seqs at the end of the processing pipeline is far to many! We have needed to use split.abund to reduce the number of unique.seqs.

Shaun

pschloss · May 6, 2025, 2:48pm

Maybe a better option would be to randomly grab like 20000 reads from each sample and then pick up from there. It will still have the problem with non-fully overlapping reads, but should allow you to get a little further along.

We’ll also keep thinking of ways to make this part of the pipeline more memory efficient.

Thanks,
Pat

system · May 16, 2025, 2:49pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error with make.contigs	3	354	December 1, 2022
Make.contigs doesn't create stability.contigs.count_table mothur bugs	7	622	September 9, 2022
make contigs- not saving outfile mothur bugs	3	1793	November 20, 2015
Make.contigs creates a wrong count_table Commands in mothur	6	574	February 26, 2022
Make.contigs does not create a count table Commands in mothur	5	585	March 31, 2023

Make.contigs runs out of memory

Related topics