cluster.split

Hi, I am trying to make OTUs from a MiSeq Illumina run with the cluster.split command, and I seem to be generating around 600GB of data. Is this normal?

I think I have a similar problem as you. I am currently running the pr.cluster step on MiSeq data which has been running for the past one week. Is this normal?
I am using a super computer with a memory of 256GB…it is currently using 186GB of memory and 32% of CPU usage…My computer has only 2 proccesors its an Intel® Xeon® Processor E5-2687W .
My question is, is it normal for this step to take that long…The filter.seq step before pre.cluster also showed me that the length of my filtered alignment was 9583, commands pasted below. Could this be the problem? should I have done something else before?

mothur > filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, processors=8, trump=.)

Using 8 processors.
Creating Filter…
Running Filter…

Length of filtered alignment: 9583
Number of columns removed: 3842
Length of the original alignment: 13425
Number of sequences used to construct filter: 1588813

Output File Names:
stability.filter
stability.trim.contigs.good.unique.good.filter.fasta

mothur > get.current()

Current files saved by mothur:
fasta=stability.trim.contigs.good.unique.good.filter.fasta
processors=8

Current working directory: G:\mothur\mothurGUI\

So getting large distance matrices like this typically happens when the data have a high error rate. With MiSeq data (and virtually all HiSeq data), this generally happens when the reads do not fully overlap. Without the full overlap you can’t get good denoising. Can you tell us what region you’re sequencing and how you’re running upstream commands?

We are sequencing the V4-region according to Caporaso et al. 2011, PNAS 108:4516-4522. The primers used are as follows. Forward primer: GTGCCAGCMGCCGCGGTAA
Reverse primer: GGACTACHVGGGTWTCTAAT
I am following the exact steps on the MothurGUI miseq SOP. I initially run the PCR.seq from E.coli with my primers to get the start and end position and got my start position as 11895 and end position as 25318, since the start and end position on the miseq SOP was only off by one primer on both ends I went with what was on the sop i.e. start: 11894 and end: 25319 I didn’t think this should make a big difference…does it? At the pre-cluster stage I set the difference at 1.
I was trying to figure out how to get the log file with all the steps I run on one file so that I can send that to you but I can only see the individual log files for each step. Would you like me to send this to you? The pre.cluster step is still running…its now been one week and one day.
Would appreciate any help
Thanks

A few questions…

Can you post the exact commands you are running?
How many samples are you sequencing?
Were the data sequenced by MiSeq, GAII, or HiSeq?

The data was sequenced on a MiSeq platform and I only have a total of 24 samples and from the summary.seqs total number of sequences are 5897685. Below are the steps I have done to the pre.cluster step…I tried to include as much information without overloading the post…
mothur > make.contigs(file=stability.files, processors=8)
Total of all groups is 5897685
mothur > summary.seqs(fasta=stability.trim.contigs.fasta, processors=8)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 149 149 0 3 1
2.5%-tile: 1 252 252 0 4 147443
25%-tile: 1 253 253 0 4 1474422
Median: 1 253 253 0 5 2948843
75%-tile: 1 253 253 0 5 4423264
97.5%-tile: 1 270 270 4 8 5750243
Maximum: 1 303 302 55 151 5897685
Mean: 1 253.803 253.803 0.434396 4.88913

of Seqs: 5897685

Output File Names:
stability.trim.contigs.summary
mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, optimize=none, maxambig=0, maxlength=275, processors=8)
It took 190 secs to screen 5897685 sequences
mothur > summary.seqs(fasta=stability.trim.contigs.good.fasta, processors=8)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 151 151 0 3 1
2.5%-tile: 1 252 252 0 4 123311
25%-tile: 1 253 253 0 4 1233107
Median: 1 253 253 0 5 2466213
75%-tile: 1 253 253 0 5 3699319
97.5%-tile: 1 254 254 0 8 4809114
Maximum: 1 275 275 0 151 4932424
Mean: 1 253.227 253.227 0 4.87394

of Seqs: 4932424

mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta)
mothur > count.seqs(group=stability.contigs.good.groups, name=stability.trim.contigs.good.names, processors=8)
Using 8 processors.
It took 85 secs to create a table for 4932424 sequences
mothur > system(command=rename silva.bacteria.pcr.fasta silva.v4.fasta)
mothur > summary.seqs(fasta=G:\lilian mothur\mothurGUI\silva.v4.fasta, processors=8)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 13424 270 0 3 1
2.5%-tile: 1 13425 292 0 4 374
25%-tile: 1 13425 293 0 4 3740
Median: 1 13425 293 0 4 7479
75%-tile: 1 13425 293 0 5 11218
97.5%-tile: 1 13425 294 1 6 14583
Maximum: 3 13425 351 5 9 14956
Mean: 1.00074 13425 292.977 0.0573014 4.57014

of Seqs: 14956

mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=G:\lilian mothur\mothurGUI\silva.v4.fasta, processors=8)
Some of you sequences generated alignments that eliminated too many bases, a list is provided in stability.trim.contigs.good.unique.flip.accnos. If you set the flip parameter to true mothur will try aligning the reverse compliment as well.It took 1517 secs to align 1608184 sequences
mothur > summary.seqs(count=stability.trim.contigs.good.count_table, fasta=stability.trim.contigs.good.unique.align, processors=8)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1968 11550 252 0 4 123311
25%-tile: 1968 11550 253 0 4 1233107
Median: 1968 11550 253 0 5 2466213
75%-tile: 1968 11550 253 0 5 3699319
97.5%-tile: 1968 11550 254 0 8 4809114
Maximum: 13425 13425 275 0 151 4932424
Mean: 1973.69 11560.6 253.054 0 4.87077

of unique seqs: 1608184

mothur > screen.seqs(count=stability.trim.contigs.good.count_table, fasta=stability.trim.contigs.good.unique.align, optimize=none, start=1968, end=11550, maxhomop=8, processors=8)
mothur > summary.seqs(count=stability.trim.contigs.good.good.count_table, fasta=stability.trim.contigs.good.unique.good.align, processors=8)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 11550 232 0 3 1
2.5%-tile: 1968 11550 252 0 4 122371
25%-tile: 1968 11550 253 0 4 1223709
Median: 1968 11550 253 0 5 2447418
75%-tile: 1968 11550 253 0 5 3671126
97.5%-tile: 1968 11550 254 0 8 4772464
Maximum: 1968 13425 275 0 8 4894834
Mean: 1967.99 11563.5 253.277 0 4.86907

of unique seqs: 1588813

total # of seqs: 4894834
mothur > filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, processors=8, trump=.)
Length of filtered alignment: 9583
Number of columns removed: 3842
Length of the original alignment: 13425
Number of sequences used to construct filter: 1588813
mothur > unique.seqs(count=stability.trim.contigs.good.good.count_table, fasta=stability.trim.contigs.good.unique.good.filter.fasta)
mothur > pre.cluster(count=stability.trim.contigs.good.unique.good.filter.count_table, fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, processors=8)

Also, I’m wondering whether you have 2x150 bp or 2x250 bp reads. It looks like 2x150. If this is the case, then I’m afraid one problem is likely that because the reads don’t fully overlap (v4 is 250 bp) then you won’t get very good desnoising (see our Kozich paper). The effect this has on the analysis is a higher error rate and more unique sequences. Give diffs=2 a try and then run cluster.split. If need be split at a deeper taxonomic level (genus?) and let it rip. This might just be the best we can do for an OTU-based approach. Alternatively, for these high error rate datasets, we suggest using the phylotype-based approach.

If you do have 2x250 or 2x30 can you send the output of:

summary.seqs(count=stability.trim.contigs.good.unique.good.filter.count_table, fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta)

and then can you try pre.cluster with diffs=2?


Pat

So i have a similar problem but my MiSeq reads are 2 x 300.

I have been trying to get my cluster.split command to run for over a week. It runs but i never seem to have enough memory. How do i work out how much memory i need? And is there another way? Should I alter the diff=4 to diff=2?

Thanks

Jaz

@Jaz-

What sequencing kit are you using and how long is your region?

Pat

Hi all,

I’m reviving this thread because I have a similar problem but didn’t see a conclusion here.

I’ve sequenced the V4 region using primers Forward GTGCCAGCMGCCGCGGTAA and reverse GGACTACHVGGGTWTCTAAT, and a MiSeq instrument.
There are 86 samples in total.

I’ve followed the MiSeq SOP and the problems are the huge distance matrix I get
(349G), the cluster command only giving me unique sequences and my shared file which is full of zeros.
I think I have 2x250bp reads (I checked the contigs.report file and the overlap length of most of my sequences is 246)

Right now I’m trying using cluster.split, with only one processor, and it’s been running since Saturday.

In my case, is it also a problem of high error rates in my data?

I’ve processed the same type of samples before, using almost identical techniques and sent them to the same sequencing centre, analysed them with
Mothur and it all worked perfectly. What is wrong now? :cry:

Below are the commands I used:


mothur > make.contigs(file=stability_svet3.files, processors=5)
mothur > summary.seqs(fasta=stability_svet3.trim.contigs.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 18 18 0 1 1
2.5%-tile: 1 251 251 0 5 390404
25%-tile: 1 253 253 0 5 3904031
Median: 1 253 253 0 5 7808062
75%-tile: 1 253 253 0 5 11712092
97.5%-tile: 1 253 253 4 6 15225719
Maximum: 1 501 500 115 250 15616122
Mean: 1 254.235 254.235 0.435277 5.08956

of Seqs: 15616122

mothur > screen.seqs(fasta=stability_svet3.trim.contigs.fasta, group=stability_svet3.contigs.groups, summary=stability_svet3.trim.contigs.summary, maxambig=0, maxhomop=8, minlength=250, maxlength=275)
mothur > summary.seqs()
Using stability_svet3.trim.contigs.good.fasta as input file for the fasta parameter.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 252 252 0 5 318072
25%-tile: 1 253 253 0 5 3180717
Median: 1 253 253 0 5 6361434
75%-tile: 1 253 253 0 5 9542151
97.5%-tile: 1 253 253 0 6 12404796
Maximum: 1 276 275 0 8 12722867
Mean: 1 252.91 252.91 0 5.08635

of Seqs: 12722867

mothur > unique.seqs(fasta=stability_svet3.trim.contigs.good.fasta)
mothur > count.seqs(name=stability_svet3.trim.contigs.good.names, group=stability_svet3.contigs.good.groups)
mothur > summary.seqs(count=stability_svet3.trim.contigs.good.count_table)
Using stability_svet3.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 252 252 0 5 318072
25%-tile: 1 253 253 0 5 3180717
Median: 1 253 253 0 5 6361434
75%-tile: 1 253 253 0 5 9542151
97.5%-tile: 1 253 253 0 6 12404796
Maximum: 1 276 275 0 8 12722867
Mean: 1 252.91 252.91 0 5.08635

of unique seqs: 303444

total # of seqs: 12722867
mothur > pcr.seqs(fasta=./silva.bacteria/silva.bacteria.fasta, start=11894, end=25319, keepdots=F, processors=5)
mothur > summary.seqs(fasta=silva.v4.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 13424 270 0 3 1
2.5%-tile: 1 13425 292 0 4 374
25%-tile: 1 13425 293 0 4 3740
Median: 1 13425 293 0 4 7479
75%-tile: 1 13425 293 0 5 11218
97.5%-tile: 1 13425 294 1 6 14583
Maximum: 3 13425 351 5 9 14956
Mean: 1.00074 13425 292.977 0.0573014 4.57014

of Seqs: 14956

mothur > align.seqs(fasta=stability_svet3.trim.contigs.good.unique.fasta, reference=silva.v4.fasta, flip=t)
mothur > summary.seqs(fasta=stability_svet3.trim.contigs.good.unique.align, count=stability_svet3.trim.contigs.good.count_table)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1968 11549 252 0 5 318072
25%-tile: 1968 11550 253 0 5 3180717
Median: 1968 11550 253 0 5 6361434
75%-tile: 1968 11550 253 0 5 9542151
97.5%-tile: 1968 11550 253 0 6 12404796
Maximum: 13425 13425 274 0 8 12722867
Mean: 1971 11548.4 252.781 0 5.08484

of unique seqs: 303444

total # of seqs: 12722867
mothur > screen.seqs(fasta=stability_svet3.trim.contigs.good.unique.align, count=stability_svet3.trim.contigs.good.count_table, summary=stability_svet3.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)
mothur > summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 11550 250 0 3 1
2.5%-tile: 1968 11550 253 0 5 293816
25%-tile: 1968 11550 253 0 5 2938151
Median: 1968 11550 253 0 5 5876302
75%-tile: 1968 11550 253 0 5 8814452
97.5%-tile: 1968 11550 253 0 6 11458787
Maximum: 1968 13425 274 0 8 11752602
Mean: 1968 11550 252.995 0 5.08566

of unique seqs: 209471

total # of seqs: 11752602
mothur > filter.seqs(fasta=stability_svet3.trim.contigs.good.unique.good.align, vertical=T, trump=.)
mothur > unique.seqs(fasta=stability_svet3.trim.contigs.good.unique.good.filter.fasta, count=stability_svet3.trim.contigs.good.good.count_table)
mothur > summary.seqs(fasta=current)
Using stability_svet3.trim.contigs.good.unique.good.filter.unique.fasta as input file for the fasta parameter.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 492 231 0 3 1
2.5%-tile: 1 492 252 0 4 5236
25%-tile: 1 492 253 0 5 52354
Median: 1 492 253 0 5 104708
75%-tile: 1 492 253 0 5 157061
97.5%-tile: 1 492 253 0 6 204179
Maximum: 1 492 266 0 8 209414
Mean: 1 492 252.937 0 5.0612

of Seqs: 209414

mothur > pre.cluster(fasta=stability_svet3.trim.contigs.good.unique.good.filter.unique.fasta, count=stability_svet3.trim.contigs.good.unique.good.filter.count_table, diffs=2)
mothur > summary.seqs(fasta=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.count_table)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 492 231 0 3 1
2.5%-tile: 1 492 253 0 5 293816
25%-tile: 1 492 253 0 5 2938151
Median: 1 492 253 0 5 5876302
75%-tile: 1 492 253 0 5 8814452
97.5%-tile: 1 492 253 0 6 11458787
Maximum: 1 492 266 0 8 11752602
Mean: 1 492 252.999 0 5.08655

of unique seqs: 100222

total # of seqs: 11752602
Output File Names:
stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.summary
mothur > screen.seqs(fasta=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.count_table, minlength=253)
mothur > summary.seqs(fasta=current, count=current)
Using stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.count_table as input file for the count parameter.
Using stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.fasta as input file for the fasta parameter.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 492 253 0 3 1
2.5%-tile: 1 492 253 0 5 293598
25%-tile: 1 492 253 0 5 2935979
Median: 1 492 253 0 5 5871957
75%-tile: 1 492 253 0 5 8807935
97.5%-tile: 1 492 253 0 6 11450315
Maximum: 1 492 266 0 8 11743912
Mean: 1 492 253 0 5.08659

of unique seqs: 94122

total # of seqs: 11743912
mothur > chimera.uchime(fasta=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.fasta, count=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.count_table, dereplicate=t, reference=self)
mothur > remove.seqs(fasta=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.fasta, accnos=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.uchime.accnos)
Removed 2111 sequences from your fasta file.
mothur > summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 492 253 0 3 1
2.5%-tile: 1 492 253 0 5 293148
25%-tile: 1 492 253 0 5 2931474
Median: 1 492 253 0 5 5862947
75%-tile: 1 492 253 0 5 8794420
97.5%-tile: 1 492 253 0 6 11432745
Maximum: 1 492 266 0 8 11725892
Mean: 1 492 253 0 5.08623

of unique seqs: 92011

total # of seqs: 11725892
mothur > classify.seqs(fasta=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.pick.fasta, count=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.uchime.pick.count_table, reference=./silva.bacteria/trainset9_032012.pds.fasta, taxonomy=./silva.bacteria/trainset9_032012.pds.tax, cutoff=80)
mothur > remove.lineage(fasta=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.pick.fasta, count=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.uchime.pick.count_table, taxonomy=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
mothur > summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 492 253 0 3 1
2.5%-tile: 1 492 253 0 5 293145
25%-tile: 1 492 253 0 5 2931449
Median: 1 492 253 0 5 5862897
75%-tile: 1 492 253 0 5 8794345
97.5%-tile: 1 492 253 0 6 11432649
Maximum: 1 492 266 0 8 11725793
Mean: 1 492 253 0 5.08623

of unique seqs: 91975

total # of seqs: 11725793
mothur > dist.seqs(fasta=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.pick.fasta, cutoff=0.20)
mothur > cluster(column=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.pick.dist, count=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.uchime.pick.count_table)
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


Output File Names:
stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.pick.an.unique_list.list
mothur > make.shared(list=stability_svet3.trim.contigs.good.unique.good.filter.unique.precluster.good.pick.an.unique_list.list, count=samples.trim.contigs.good.unique.good.filter.unique.precluster.good.uchime.pick.count_table, label=0.03)

http://www.mothur.org/wiki/Frequently_asked_questions#Why_is_my_dataset_only_clustering_to_.22unique.22.3F

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/