Using mothur v.1.48.0, I’m running out of memory with make.contigs and i did not expect it to use much memory. I have 612 samples with roughly 300 K reads each. It seemed to fail at the end, where i assume the count_table was being created (I could see the “Processing file pair” for all samples in my stdout). This failed with 128Gb, i tried 4 to 16 processors, and had ‘out of memory’ error for all attempts.
I ended up splitting the data into 6 subsets (6 being arbitrary), making contigs and then merging the files together. However, I had issues merging the 6 count_files, and again, used a split-combine approach, where i merged count_tables 1-2-3, and 4-5-6, and then merged those final 2 together. This ended up working, and appeared to used about 90Gb memory.
I guess my question is: why would it fail with the full data? and why does the split-combine approach work? I did not think this was a memory intensive step? Indeed, summary.seqs is now failing
From the successful split and combine job (make.contigs x6, merge x2 and merge x 1, count.groups)
The full dataset is likely too large for the memory allocated.
and why does the split-combine approach work? I did not think this was a memory intensive step? Indeed, summary.seqs is now failing
The split approach uses less memory, but if summary.seqs is failing then the final merge might not have had enough memory to complete. An incomplete / corrupted count file would like cause an out of memory or segfault error. Alternatively, if it did complete the merge, you don’t have enough memory to store the fasta file and count file in memory.
My naive understanding of the count table generation is that it is “append-like” - adding new data to a previous file ? I.e. make contigs of a sample and then add seq name and count to the count file sequentially (add column and rows). Like from how i can see the count table increase in size with other commands before they complete .. hence i assumed it wasnt memory intensive .. i do not know the inner workings of the functions and have only made assumptions - they must be incorrect.
I find with bigger datasets there is a balance between processors and allocated memory across the different steps of the pipeline. Are such things discussed in the MiSeq SOP? Like giving a ballpark estimates on CPU vs Mem requirements for each step - “This step is memory intensive so opt for high memory per cpu”, “This step is not memory intensive so opt for more processors” than actual specific values ..
I’m afraid that a bigger problem for you is that you are generating a ton (300k reads per sample is about 20-fold more than typical) of contigs that are 300 nt long. This means that you either don’t have fully overlapping reads or that you’re using the 2x300 chemistry. Both will have a very high error rate making most things appear as unique sequences. This is why we strongly recommend using 2x250 chemistry to get fully overlapping reads that cover the V4 region. For processes like creating the count file, memory is required to create the file since we’re adding columns to the data and not merely adding data as rows.
You can see more here…
Illumina’s quality hasn’t improved in the 11 years since I wrote that post
I’ve been out of the game for almost 10 years and am helping on this project. Unfortunately, the design was out of my hands. I do remember this post of yours. Shame that the technology has have not improved.
Indeed, number of unique.seqs at the end of the processing pipeline is far to many! We have needed to use split.abund to reduce the number of unique.seqs.
Maybe a better option would be to randomly grab like 20000 reads from each sample and then pick up from there. It will still have the problem with non-fully overlapping reads, but should allow you to get a little further along.
We’ll also keep thinking of ways to make this part of the pipeline more memory efficient.