cluster and cluster.split error

Hi there

I am a newbie so I apologize in advance for any stupid questions I may ask…
So I am working with MiSeq sequences that are 16S V4 region from guts from fungus-growing ants, I know that in those ant guts there are 3-5 abundant OTUs (which are the 95-98% of the whole community) and maybe some traces of other ones
To process my sequences I follow the MiSeq SOP tutorial and I ran the following commands:

make.contigs(file=Untitled2, processors=8)
summary.seqs(fasta=Actinotrim.contigs.fasta)
screen.seqs(fasta=Actinotrim.contigs.fasta, group=Actinocontigs.groups, maxambig=0, maxlength=330)
unique.seqs(fasta=Actinotrim.contigs.good.fasta)
count.seqs(name=Actinotrim.contigs.good.names, group=Actinocontigs.good.groups)
summary.seqs(count=Actinotrim.contigs.good.count_table, fasta=Actinotrim.contigs.good.unique.fasta)
align.seqs(fasta=Actinotrim.contigs.good.unique.fasta, reference=silva.bacteria.v4.fasta, processors=8)
summary.seqs(fasta=Actinotrim.contigs.good.unique.align, count=Actinotrim.contigs.good.count_table)
"Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 11895 25318 270 0 3 1
2.5%-tile: 11895 25319 292 0 4 374
25%-tile: 11895 25319 293 0 4 3740
Median: 11895 25319 293 0 4 7479
75%-tile: 11895 25319 293 0 5 11218
97.5%-tile: 11895 25319 294 1 6 14583
Maximum: 11897 25319 351 5 9 14956
Mean: 11895 25319 292.977 0.0573014 4.57014

of Seqs: 14956"

screen.seqs(fasta=Actinotrim.contigs.good.unique.align, count=Actinotrim.contigs.good.count_table, summary=Actinotrim.contigs.good.unique.summary, start=11895, end=25319, maxhomop=8, processors=8)
summary.seqs(fasta=current, count=current)
"Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 11895 25319 268 0 3 1
2.5%-tile: 11895 25319 292 0 4 125220
25%-tile: 11895 25319 292 0 4 1252194
Median: 11895 25319 293 0 4 2504388
75%-tile: 11895 25319 293 0 5 3756581
97.5%-tile: 11895 25319 293 0 5 4883555
Maximum: 11895 25319 305 0 8 5008774
Mean: 11895 25319 292.644 0 4.43981

of unique seqs: 703114

total # of seqs: 5008774

Output File Names:
shit/Actinotrim.contigs.good.unique.good.summary"

filter.seqs(fasta=Actinotrim.contigs.good.unique.good.align, vertical=T, trump=.)
unique.seqs(fasta=Actinotrim.contigs.good.unique.good.filter.fasta, count=Actinotrim.contigs.good.good.count_table)
pre.cluster(fasta=Actinotrim.contigs.good.unique.good.filter.unique.fasta, count=Actinotrim.contigs.good.unique.good.filter.count_table, diffs=2)
summary.seqs(fasta=current, count=current)
chimera.uchime(fasta=Actinotrim.contigs.good.unique.good.filter.unique.precluster.fasta, count=Actinotrim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
remove.seqs(fasta=Actinotrim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=Actinotrim.contigs.good.unique.good.filter.unique.precluster.uchime.accnos)
classify.seqs(fasta=Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=Actinotrim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=silva.bacteria.ng.fasta, taxonomy=silva.bacteria.silva.tax, cutoff=80, processors=8)
remove.lineage(fasta=Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=Actinotrim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.silva.wang.taxonomy, taxon=Chloroplast-Mitochondria-Archaea-Eukaryota)

after all of the above steps I end up with 131000 sequences and my objective is to do an OTU-based analysis
however even if I am able to run normally the dist.seqs command (it takes about 4hr to make an 80GB dist file) when I run the cluster command: cluster(column=Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.pick.ed.dist, count=Actinotrim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.ed.count_table)
the computer runs out of memory…

then I m trying to run cluster.split, first using the following command
cluster.split(column=Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.pick.ed.dist, count=Actinotrim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.ed.count_table, taxonomy=Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.silva.wang.pick.ed.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.05)
again it runs out of memory

then I tried to run in the way it is presented in the MiSeq SOP
cluster.split(fasta=Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.pick.ed.fasta, count=Actinotrim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.ed.count_table, taxonomy=Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.silva.wang.pick.ed.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.05)
again it starts normally but in the terminal it says that the method is NOT “classify” but “fasta”
after a couple of hours and while it is clustering the dist files I get again a memory error
"Clustering shit/Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.pick.filter_ed.fasta.2.dist
|


Clustering shit/Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.pick.filter_ed.fasta.1.dist


Clustering shit/Actinotrim.contigs.good.unique.good.filter.unique.precluster.pick.pick.filter_ed.fasta.0.dist
Cutoff was 0.055 changed cutoff to 0.02
[ERROR]: Could not open 31150.temp
[ERROR]: std::bad_alloc has occurred in the ClusterSplitCommand class function createProcesses. This error indicates your computer is running out of memory. This is most commonly caused by trying to process a dataset too large, using multiple processors, or a file format issue. If you are running our 32bit version, your memory usage is limited to 4G. If you have more than 4G of RAM and are running a 64bit OS, using our 64bit version may resolve your issue. If you are using multiple processors, try running the command with processors=1, the more processors you use the more memory is required. Also, you may be able to reduce the size of your dataset by using the commands outlined in the Schloss SOP, http://www.mothur.org/wiki/Schloss_SOP. If you are uable to resolve the issue, please contact Pat Schloss at mothur.bugs@gmail.com, and be sure to include the mothur.logFile with your inquiry.% "


Is 131000 sequences too many? is this normal? Am I doing sth wrong?

any help would be very much appreciated

Cheers
Panos

PS: I have edited the headers of the fasta, the taxonomy and the count_table files (I shortened the headers) because if I hadnt done that then the files after the dist.seqs or the cluster.split would be more than 200-300gb and my pc was running out of space…

A couple of things…

  1. Your V4 region seems longer than ours. Are you following the Kozich protocol or are you sequencing through the indices and primers and into the fragment. In other words, does your sequence contain the indices, primers, and fragment of interest? The Kozich method only generates sequence data for the fragment of interest. If your sequence data includes the indices and primers, you’ll need to remove those by providing an oligos file. If you don’t then even if you have the same sequence in every sample, it will show up multiple times because the indices will be different.

  2. You need to use a larger cutoff than 0.05 in cluster/cluster.split if you want the final threshold of 0.03 (http://www.mothur.org/wiki/Frequently_asked_questions#Why_is_my_dataset_only_clustering_to_.22unique.22.3F)

Hi again

you were absolutely right
so apparently the primers and a small part of the adapters was in the sequences
I removed them all after the make.contigs command with cutadapt and then everything went smoothly
my alignment was 250bp and I didnt encounter any further problems

I changed my cluster.splt cutoff to 0.15 and the table looks fine! :slight_smile:

One quick question: when I classify, the taxonomy file I use is the silva.bacteria.tax but for the reference file I assumed that I should use the same fasta file I used for the alignment (the v4 region of silva bacteria) which I have removed the gaps by running degap.seqs… is that correct?

cheers
Panos

PS: thanks again, you already helped me a lot!

Great - and you should be able to remove the barcodes/primers using trim.seqs without having to leave mothur.

Pat