cluster memory error

Good day.
I’m running an analysis of 16s amplicones obtained from Illumina run as in Caporaso et al., 2010 PNAS.
I’m using Mothur v.1.16.0 on server and 1.20.1 Win 64 on desktop PC Win 7.
In my FASTA file I have 270 groups, 10,000 sequences in each. The length of sequence is 102 bp. Quality is >40 according to FASTX.
The pipeline is as follows:
summary.seqs(fasta=stool.trim.fasta)
unique.seqs(fasta=stool.trim.fasta)
summary.seqs(fasta=stool.trim.unique.fasta)
align.seqs(candidate=stool.trim.unique.fasta, template=silva.bacteria.fasta, flip=T, processors=2)
summary.seqs(fasta=stool.trim.unique.align)
screen.seqs(fasta=stool.trim.unique.align, name=stool.trim.names, group=stool.groups, end=16312, optimize=start, criteria=85, processors=2)
summary.seqs(fasta=stool.trim.unique.good.align)
filter.seqs(fasta=stool.trim.unique.good.align, vertical=T, trump=., processors=2)
summary.seqs(fasta=stool.trim.unique.good.filter.fasta)
unique.seqs(fasta=stool.trim.unique.good.filter.fasta, name=stool.trim.good.names)
summary.seqs(fasta=stool.trim.unique.good.filter.unique.fasta)
pre.cluster(fasta=stool.trim.unique.good.filter.unique.fasta, name=stool.trim.unique.good.filter.names, diffs=1)
summary.seqs(fasta=stool.trim.unique.good.filter.unique.precluster.fasta)
filter.seqs(fasta=silva.gold.align, hard=stool.filter)
chimera.slayer(fasta=stool.trim.unique.good.filter.unique.precluster.fasta, template=silva.gold.filter.fasta, minsnp=100, processors=2)
remove.seqs(accnos=stool.trim.unique.good.filter.unique.precluster.slayer.accnos, fasta=stool.trim.unique.good.filter.unique.precluster.fasta, name=stool.trim.unique.good.filter.unique.precluster.names, group=stool.good.groups, dups=T)
dist.seqs(fasta=stool.trim.unique.good.filter.unique.precluster.pick.fasta, output=lt, processors=2)
system(copy stool.good.pick.groups stool.final.groups)
system(copy stool.trim.unique.good.filter.unique.precluster.pick.phylip.dist stool.final.dist)
system(copy stool.trim.unique.good.filter.unique.precluster.pick.names stool.final.names)
system(copy stool.trim.unique.good.filter.unique.precluster.pick.fasta stool.final.fasta)
cluster(phylip=stool.final.dist)

Here program reports a memory error.
mothur > cluster(phylip=stool.final.dist)
********************###########
Reading matrix: |[ERROR]: St9bad_alloc has occurred in the ReadPhylipMatrix class function read.

This error is coming from both server and desktop runs. The phylip matrix contains 128246 sequences. Size is 50Gb.
Does it mean the matrix is too big? I did it with the matix 10 times smaller. The result is similar memory error.
Please advise how can I improve analysis to continue running pipeline and go to
make.shared(list=stool.final.an.list, group=stool.final.groups, label=0.03)
summary.single(shared=stool.final.an.shared, calc=nseqs-coverage-simpson-sobs-invsimpson-chao)
Thank you in advance!

It’s memory and given its Illumina generated data with no data curation, it’s because the sequencer is generating diversity. You need to do some type of quality trimming like we propose in the second approach described in the SOP (qwindowaverage=35, qwindowsize=50). Prepare for your reads to get much much shorter.

Thank you Pat! I will try SOP filtering.
Also I’m trying:
cluster(phylip=stool.final.dist, cutoff=0.03)
This option makes it running so far. I got read.dist(phylip=stool.final.dist) done on server. Now trying to cluster: cluster().
Do you think cutoff=0.03 is reasonable?
Thanks!

This is about cutoff=0.03
cluster(****, cutoff=0.03) just finished running.
It was definite improvement over just cluster(phylip=
). I examined shared report and found coverage 0.1-0.3.
This is too low considering 10,000 sequences per sample and very simple microbiota in each sample (454 was giving me consistent coverage 0.95 with <1,000 reads). It is a strange result. Any idea?
Regarding quality trimming.
I don’t think it will work. My sequence quality supposedly >40 for each base. So moving window won’t improve it I guess.

My sequence quality supposedly >40 for each base.

Sorry, but for the dataset you’re speaking of, the quality scores are definitely not >40 for every base. The problem is the quality of the data.