Startup disk errors when creating OTU matrix???

We are hoping somebody might have some suggestions about a problem we keep encountering while running Mothur. Originally we thought that it was a computer related problem; however, the computer technician assures me there is nothing wrong with the computer itself or the way it is set up. We have had the error come up at a few steps, but most recently it occurs when we are generating an OTU matrix (e.g. cluster(culumn=……, cutoff=0.20, method=average)). The error is: Your Mac OS X startup disk has no more space available for application memory, please force quit programs. At this point Mothur is the only program we have running & there is 2.43 TB space available!? Does anybody have any suggestions? Others in our lab have used Mothur from start to finish, so at one stage it was working. We have downloaded the newest version of Mothur (downloaded on 20 Feb 2013) and are running it on a Mac OS X 10.6.8 operating system, 2 X 2.93 GHz 6-Core Intel Xeon, 32 GB / 333 MHz memory & Mac HD with 2.43 TB space. Any help would be greatly appreciated!

Are you possibly running out of RAM?

Thank you for the reply.
I thought it might be a RAM issue as well, but the computer technician used “team viewer” while I was running the cluster command and said I had heaps of RAM (32 G). I’m not super computer literate, so I took their word for it. I will try it again and see if I can tell if if its loosing RAM. Any other suggestions??

How big is the distance matrix you are trying to cluster?

After trying again, I think that you are correct. We have a total of 32 GB of RAM. When the error message comes up (after running for a few hours), we drop down to 58 MB. As soon as I force-quit Mothur it jumps back up to 29 GB. I’m processing 30 water samples. The final.dist file is pretty big (223 GB). I’m new at running the program, so I’m not sure what the typical file size would/should be. If that seems too big, mabye I’ve stuffed something in the previous commands?

What type of sequencing data do you have? A 223 GB distance matrix is much bigger than anything we’ve ever seen (even when pooling dozens of plates). Are you following the SOP analysis example?

Sorry, I was looking at the wrong file folder yesterday when I gave you the 223 GB. The final.dist is only 90 GB, but that still may be too big. A previous person in our lab made a step by step Mothur procedure, which is what I’m following–though, I know she has been through your SOP many times. It is Roche 454 data (using v4 primers).

It would be helpful to know exactly what that pipeline is. I suspect you guys are doing things to boost sequence numbers at the expense of your error rate, which has the side effect of creating 90GB distance matrices.

I’ve started over from scratch & listed what I’ve done thus far (including some file sizes).
We start with fasta/qual files:
trim.seqs(fasta=XXXX.fna, oligos=XXXX.oligos, qfile=XXXX.qual, maxambig=0, maxhomop=8, qaverage=35, minlength=200, maxlength=500)
make.group(fasta=XXX1.trim.fasta-XXX2.trim.fasta-XXX3.trim.fasta, groups=XXX1-XXX2-XXX3)
merge.files(input=XXX1.trim.fasta-XXX2.trim.fasta-XXX3.fasta-….etc, output=XXXcombined.fasta) Fasta file at this point=197MB
unique.seqs(fasta=XXXX.fasta)
align.seqs(candidate=XXXXX.unique.fasta, template=silva.bacteria.fasta, flip=T, processors=2) Align file is 9GB
We removed the bad seqs from our alignment
remove.seqs(accnos=XXX.unique.flip.accnos, fasta=XXX.fasta)
remove.seqs(accnos=XXX.unique.flip.accnos, group=XXX.groups)
remove.seqs(accnos=XXX.unique.flip.accnos, name=XXXX.names)
summary.seqs(fasta=XXXX.unique.align, name=XXXX.pick.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 1074 14 0 2 1
2.5%-tile: 15647 34121 479 0 4 9751
25%-tile: 15647 34124 486 0 5 97501
Median: 15647 34124 488 0 5 195002
75%-tile: 15647 34124 489 0 5 292503
97.5%-tile:15647 34124 496 0 6 380253
Maximum: 16360 35131 500 0 8 390003
Mean: 15646.2 34111.2 487.568 0 5.059

of unique seqs: 211191

total # of seqs: 390003

screen.seqs(fasta=XXXX.unique.align, name=XXXX.pick.names, group=XXXX.pick.groups, end=XXXX, optimize=start, criteria=95, processors=2)
filter.seqs(fasta=XXXX.unique.good.align, vertical=T, trump=., processors=2) Fasta file at this point=245MB
unique.seqs(fasta=XXXX.pick.unique.good.filter.fasta, name=XXXX.pick.good.names) Fasta file=244MB, name=9MB
pre.cluster(fasta=XXXX.pick.unique.good.filter.unique.fasta, name=XXXX.pick.unique.good.filter.names, processors=2)

The pre-clustering takes quite a while & is still running. I’ll do the chimeraslayer step next, which in the past, has taken a few days.
chimera.slayer(fasta=XXXX.unique.good.filter.unique.precluster.fasta, name=XXXX.unique.good.filter.unique.precluster.names, group=XXXX.good.groups, reference=self, processors=12)

This comes up a lot, so let me just mention something first. Data quality is important for two reasons. First, because if you have crap data, you’ll have crap results. Second, the analysis that is done uses unique sequences. So every sequence with an error in it is likely to be a new unique sequence. This is how people get ginormous distance matrices.

Towards this goal, there are two (three) things to do differently…

  1. In trim.seqs qaverage=35 does very little to improve quality of the data. If you look at Fig. 1B of our PLoS ONE paper, you’ll see the results of using qaverage=35 in the 7th column. Interestingly of all the options (except hardcutoff @ q25) it removes the most sequences. Also, the average error rate is close to 0.4%. In contrast, if you were to use qwindowaverage=35, qwindowsize (see the SOP for how to do this), you’ll get back more total sequences, and the average error rate will drop to 0.08% (see the 11th column). The error rate drops even further after the pre.cluster step.

  2. It would really be a lot better to get the original sff files and use the trim.flows/shhh.flows approach. There teh error rates are lower, you get more data back, and the reads are actually longer (250 bp vs. 200 bp).

  3. In pre.cluster, give the command your group file and it will do the pre.clustering by group. This will cause it to go a lot faster and probably be more accurate.

Finally, I know a lot of groups have their own batchfiles that are based (loosely or directly) on the SOP. I would encourage people to look at it periodically and see how it has changed with successive releases. This can be done easily by comparing versions in the history tab.

Hope this helps,
Pat

Thanks Pat, that helps a lot. I’ll make the suggested changes & hopefully have better luck.
Cheers!