Produce too large amount of data when running dist.seqs

Hi,
I’m using mothur to analysis 200 samples obtained from a run of Illumina GAIIx, aoubt 200Gb original 16s Fastq data.

Everything is ok and beautiful but a problem encountered when command dist.seqs running and a total of 8Tb data has produced. Is there any problem?

Are you following the steps in the SOP, http://www.mothur.org/wiki/MiSeq_SOP?

yes.

Can you post the commands you have run so far?

Hi,westcott,
thanks for your helping!
Raw data of 2x150bp from 7 lanes of GAIIx was used to analyse (40Gb *.fastq.gzip), V3 region (341f-518r).
It’s may be caused by unique sequences. There were still 4 million unique sequences When commond remove.lineage done.


make.contigs(file=stability.files, processors=8) 
summary.seqs(fasta=stability.trim.contigs.fasta)
system(gzip stability.scrap.contigs.fasta) 
screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, summary=stability.trim.contigs.summary, maxambig=0, maxlength=250) 
summary.seqs(fasta=stability.trim.contigs.good.fasta) 
system(gzip stability.trim.contigs.fasta)
unique.seqs(fasta=stability.trim.contigs.good.fasta) 
count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups) 
summary.seqs(count=stability.trim.contigs.good.count_table) 
system(gzip stability.trim.contigs.good.fasta)
align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v3.fasta) 
summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table) 
system(gzip stability.trim.contigs.good.unique.fasta)
screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=888, end=8362, maxhomop=9) 
summary.seqs(fasta=current, count=current)
system(gzip stability.trim.contigs.good.unique.align) 
filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.) 
system(gzip stability.trim.contigs.good.unique.good.align)
unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table) 
system(gzip stability.trim.contigs.good.unique.good.filter.fasta)
pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2) 
system(gzip stability.trim.contigs.good.unique.good.filter.unique.fasta)
chimera.uchime(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t) 
remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.accnos) 
system(gzip stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta)
summary.seqs(fasta=current, count=current) 
classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80) 
remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

4 million uniques will create a distance matrix too large to cluster. Here are some cluster memory stats, http://www.mothur.org/wiki/Cluster_stats. I think you are limited to phylotype based analysis.

To follow up on Sarah’s comment, I think the problem has everything to do with using GAII and short reads that do not fully overlap. If you look at our Kozich et al. paper in AEM, we show that the sequences have to fully overlap to reduce the error rate. Like she indicated, doing an OTU-based analysis may be computationally out of reach.

westcott, thanks.
Has any other way to decrease the unique sequences?
-------------when
make.contigs done, generated 160,000,000 sequences. stability.trim.contigs.fasta
screen.seqs done, generated 130,000,000 sequences. stability.trim.contigs.good.summary
unique.seqs done, generated 17,382,705 sequences. stability.trim.contigs.good.unique.fasta
align.seqs and screen.seqs done, generated 13,945,444 sequences. stability.trim.contigs.good.unique.good.align
pre.cluster (diffs=2) done, generated 7,348,122 sequences. stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta
chimera.uchime, remove.seqs classify.seqs, and remove.lineage done, generated 3,902,627 sequences. stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta
There’re still too large!!
And when I check stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.summary , I find large number of unique sequences have only 1 of “numSeqs”, can I delete them?

Thanks pschloss,
I have read your paper (AEM, 2013, Miseq data) before, nice paper and I have used your pipline in my lab. :smiley:
I used 2x150bp for V3 region and at least 80-120bp overlap.