Large datasets: out of memory

I have been trying to do clustering on a large dataset of size 30075 sequences, however, without preclustering.

My commands are as follows

summary.seqs(fasta=sequence.fasta)

align.seqs(fasta=sequence.fasta, reference=../silva.v4.fasta)

dist.seqs(fasta=sequence.align, output=lt, calc=eachgap)

cluster(phylip=sequence.phylip.dist)

It fails due to out of memory with the following error messgae:

[ERROR]: std::bad_alloc has occurred in the ClusterClassic class function getSmallCell. This error indicates your computer is running out of memory.  This is most commonly caused by trying to process a dataset too large, using multiple processors, or a file format issue. If you are running our 32bit version, your memory usage is limited to 4G.  If you have more than 4G of RAM and are running a 64bit OS, using our 64bit version may resolve your issue.  If you are using multiple processors, try running the command with processors=1, the more processors you use the more memory is required. Also, you may be able to reduce the size of your dataset by using the commands outlined in the Schloss SOP, http://www.mothur.org/wiki/Schloss_SOP. If you are uable to resolve the issue, please contact Pat Schloss at mothur.bugs@gmail.com, and be sure to include the mothur.logFile with your inquiry.

I have 8GB RAM with intel i5. If I increase the memory to 16GB or 24GB, is it going to help me?

It’s hard to say whether it will fail or not. You could try setting cutoff=0.20 in cluster.

I’m not sure what you’re doing upstream of these steps, but I think you’re making your life really hard. Things like pre.cluster, filter.seqs, screen.seqs, unique.seqs, etc. are all designed to reduce the number of unique sequences that have to be clustered to make it more RAM friendly and accelerate the process.

Pat