shhh.flows memory problem

I am using mothur on a cluster computer and I have problems with shhh.flows and memory.

If I put processors =1, the command doesn’t finish within one week, which is the maximum walltime. If I run processors=2, then it runs out of memory and shhh.flows is not completed.
[ERROR]: std::bad_alloc has occurred in the ShhherCommand class function driver. This error indicates your computer is running out of memory. …

My dataset has 590 000 sequences and 17 samples (the largest sample has 78 000 sequences)

Here is the command:

mothur “#trim.flows(flow=water.flow, oligos=water.oligos, pdiffs=2, bdiffs=1, minflows=360, maxflows=720, order=B, processors=2)”
mothur “#shhh.flows(file=water.flow.files, order=B, processors=2)”

And the information about the cluster computer:
#SBATCH -N 1
#SBATCH --tasks-per-node=16
#SBATCH --exclusive

CPU: 2 AMD6220 (3.0 Ghz, 8-core)
Memory: 32-64 Gb (2-4 GB/core)

If I run the trim.flows command with maxflows=450 and 4 processors then the shhh.flows does not crash.
I ran the same command on a similar dataset and it was fine, I even used 4 processors (that dataset had 650 000 sequences, 28 samples and 52 000 sequences for the largest sample)

Is there a way to change the commands or the requirements for the cluster computer so I don’t run out of memory?

In trim.flows you need to set minflows to the same value as maxflows or you will have this type of problem. With order=B, you can safely go up to minflows=1000, maxflows=1000, assuming that your reads are that long.

Pat

Thank you very much for the answer!

I don’t really understand how to pick the correct flow values for the denoising step for my samples. I tested different values and they change the results quite a lot (the OTU table looks quite different with different parameters). I’m working with the V1-V3 region.

In the wiki it said that the V1-V3 region only requires 350-400 flows, which doesn’t really work for my dataset. If I use minflows=360, maxflows=360 almost all my sequences get discarded.
I tested maxflows=450 and it gave much shorter sequences with an average length below 200bp. For this reason I chose minflows=360, maxflows=720 as it was suggested by Chris Quince for the V1-V3 region in the wiki, but then I ran into the problems I described above.

Is minflows=1000, maxflows=1000 still going to reduce the error rate for the V1-V3 region?
Which minflows and maxflows values would you recommend for the denoising of the V1-V3 region?

In the wiki it said that the V1-V3 region only requires 350-400 flows, which doesn’t really work for my dataset. If I use minflows=360, maxflows=360 almost all my sequences get discarded.
I tested maxflows=450 and it gave much shorter sequences with an average length below 200bp. For this reason I chose minflows=360, maxflows=720 as it was suggested by Chris Quince for the V1-V3 region in the wiki, but then I ran into the problems I described above.

If you look at Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based Studies you will see that if you do 360/720 you won’t get any denoising. If your data were really generated using flow order B and you have the V1-V3 region, then I would use minflows=1000, maxflows=1000. It’s also possible that your data are crummy and you didn’t get good read lengths. Regardless, the minflows and maxflows values need to be the same.

Also, what do you get when you run summary.seqs on the fasta file generated by sffinfo?

Pat

Thank you so much for the answer. That is very helpful for me.

The data were generated this year with a FLX++, so I guess it should be order=B.

Here is the output of the summary.seqs command of the dataset that failed:

mothur > summary.seqs(fasta=water.fasta)

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 62 62 0 3 1
2.5%-tile: 1 446 446 0 4 14779
25%-tile: 1 488 488 0 4 147781
Median: 1 511 511 0 5 295562
75%-tile: 1 540 540 0 5 443342
97.5%-tile: 1 555 555 0 7 576344
Maximum: 1 1646 1646 8 32 591122
Mean: 1 511.861 511.861 0.0243215 4.80473

of Seqs: 591122

Output File Names:
water.summary

It took 12 secs to summarize 591122 sequences.

I tried the minflows=1000 and maxflows=1000 and I loose so many sequences.

Here are the commands I ran:

mothur > sffinfo(sff=water.sff, flow=T)
mothur > summary.seqs(fasta=water.fasta)
mothur > trim.flows(flow=water.flow, oligos=water.oligos, pdiffs=2, bdiffs=1, minflows=1000, maxflows=1000, order=B, processors=2)
mothur > shhh.flows(file=water.flow.files, order=B, processors=2)
mothur > trim.seqs(fasta=water.shhh.fasta, name=water.shhh.names, oligos=water.oligos, maxambig=0, maxhomop=8, bdiffs=1, pdiffs=2, processors=8, minlength=200)
mothur > summary.seqs(fasta=current, name=current)

And here is the output of summary.seqs after trim.seqs (Unfortunatelly I don’t know how to make it look easier to read…the tabs disappear after posting)

mothur > summary.seqs(fasta=current, name=current)
Using water.shhh.trim.fasta as input file for the fasta parameter.
Using water.shhh.trim.names as input file for the name parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 432 432 0 3 1
2.5%-tile: 1 489 489 0 4 645
25%-tile: 1 489 489 0 5 6446
Median: 1 496 496 0 5 12892
75%-tile: 1 514 514 0 5 19337
97.5%-tile: 1 534 534 0 6 25138
Maximum: 1 614 614 0 8 25782
Mean: 1 501.864 501.864 0 4.99732

of unique seqs: 5695

total # of seqs: 25782

Does that mean that my sequences are not good or should I choose different values for maxflows and minflows?

Try 800 or 900 flows, but I suspect the data are bad.

Pat