Stuck at clustering, its running for more than a week

Hi, This is Sumaiya
I am using miseq SOP for first time but did the example and went good.
I have 96 samples (48*2), did a next seq run. total sequence was 17 million , after processing improved sequences, the number of unique sequence was 626349. everything went on smooth. I could also create a dist file of 50 GB which took 4 days. but when i tried to cluster, it was stuck there for for than 7 days, it does not show error or anything it was just there and got killed today.
simultaneously during the same time, I created a new folder, pasted everything except dist file and ran cluster split command with taxa=4 and cutoff=0.03. it started to run very fast, it made 127 subdivions and created individual count.temp file, disttemp file, created final.dist file (50 GB again), and then within few hours it created 115 out of 127 final.XXX.opti_mcc.list files . then it took 4 days to create more new 4 out of 127 final.XXX.opti_mcc.list files and after that it was stuck for days and got killed today. finally, it created 119 out of 127 final.XXX.opti_mcc.list files but not the combined final.opti_mcc.list files.
I am using a MacBook Pro with 16 GB RAM and 600 GB available space.reading the previous blogs, it seemed like a RAM issue. so, now if I run the same thing with cluster split command in a computer with around 100 GB RAM will everything be okay? Or should I do 48 samples instead of 96 samples so the numbers are less and in that case can I run it in my macbook? but I guess the distance file would still be around 50 GB. I know many people have already faced it, any suggestions would be really heplful.
Thank you.

Hi Sumaiya,

Can you tell me what region you sequenced and what MiSeq chemistry you used? I suspect you sequenced something like V3-V4 or V4-V5 rather than V4 and might have a very high sequencing error rate. You might want to read this blog post…

Also, you can feel free to post to this tread the commands you ran to process the data along with output from summary.seqs along the way to help us diagnose what else might be wrong.

Take care,

Thanks Pat.
The sequencing targetted the V4 region. About the sequencing, I know it was run on a next seq, I do not have any additional information at hand right now unfortunately.
Anyway, when cluster and cluster split was killed yesterday, I ran the ASV make shared file command and it was completed within a minute. Just wanted to let you know.
I also have a question, while making contigs at the initial stage, can I change the default qulaity score? is there and code/command to do that?
Really appreciate your help.

The commnads are here:

mothur > make.contigs(file=stability.files)
mothur > summary.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table)

                   Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	    1	35	35	0	2	1
2.5%-tile:	           1	291	291	0	4	423386
25%-tile:	           1	292	292	0	4	4233852
Median:  	           1	292	292	0	5	8467704
75%-tile:	           1	292	292	0	6	12701556
97.5%-tile:	   1	301	301	3	6	16512022
Maximum:	   1	602	602	145	301	16935407
Mean:	           1	293	293	0	5
# of unique seqs:	16935407
total # of seqs:	16935407
It took 328 secs to summarize 16935407 sequences.

(As the 97.5% was 301, i took maxlenghth =300)`

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, count=stability.contis.count_table, maxambig=0, maxlength=300, maxhomop=8)
It took 77 secs to screen 16935407 sequences, removed 2775175.

mothur > summary.seqs(count=current)
mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta, count=stability.contigs.good.count_table)
mothur > summary.seqs(count=stability.trim.contigs.good.count_table)
mothur > pcr.seqs(fasta=silva.bacteria.fasta, start=11895, end=25318, keepdots=F)
mothur > rename.file(input=silva.bacteria.pcr.fasta, new=silva.v4.fasta)
mothur > summary.seqs(fasta=silva.v4.fasta)
mothur>align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)
mothur>summary.seqs(fasta=stability.trim.contigs.good.unique.align, mothur>count=stability.trim.contigs.good.count_table)
mothur >screen.seqs(fasta=stability.trim.contigs.good.unique.align,count=staility.trim.contigs.good.count_table, start=1, end=13424)
mothur > summary.seqs(fasta=current, count=current)
mothur>filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
mothur>pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
mothur>chimera.vsearch(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
mothur>summary.seqs(fasta=current, count=current)

# of unique seqs:	626349
total # of seqs:	13451223

mothur>classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.count_table, reference=trainset9_032012.pds.fasta,

It took 1943 secs to classify 626349 sequences.
It took 20 secs to create the summary file for 626349 sequences.

mothur>remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.count_table,, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

mothur>, count=current)

mothur>rename.file(fasta=current, count=current, taxonomy=current, prefix=final)

mothur > dist.seqs(fasta=final.fasta, cutoff=0.03)

It took 245433 secs to find distances for 609942 sequences. 557644008 distances below cutoff 0.03.

Output File Names: 
/Users/sumaiyasaifur/Documents/mothur analysis_biodegradation/final.dist

mothur > cluster(column=final.dist, count=final.count_table)
Using 10 processors.

You did not set a cutoff, using 0.03.

Clustering /Users/sumaiyasaifur/Documents/mothur analysis_biodegradation/final.dist
and got killed.


mothur > classify.otu(list=final.asv.list, count=final.count_table, taxonomy=fnal.taxonomy, label=ASV)

Can you post the output of running this?

summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)

I suspect the problem is that you used the 2x300 chemistry. This is a problem because (1) the quality drops off on this chemistry after 500 total nt sequenced (see the post I included before), (2) the V4 region is only 250 nt long, and (3) sequencing beyond the end of the fragment increases error rates further. Also, because you sequenced beyond the length of the amplicon you likely also have the primers, barcodes, and adapters included on your sequences. This will also increase the number of unique reads since if the same sequence shows up in multiple samples, it will appear to be a duplicate in each dataset.

The biggest problem was using 2x300 rather than 2x250 chemistry. Aligning to the V4 region reference that does not include the primers should remove the primers, barcodes, and adapters, but you’ll still have the low sequence quality problem to deal with. You could also use overlap=T in make.contigs, but the error rate will still be a problem. I wouldn’t trust ASVs since each sequencing error will be its own ASV and if you can get OTUs to cluster on a computer with enough RAM, you’ll have a lot of spurious OTUs. Short of re-sequencing the data using the 2x250 chemistry, you could use a phylotype-based approach

Sorry I don’t have better news.


Thank you for your insights. So, with the existing sequencing data, there is nothing much to do, right? I also see the quality drops after 250 nt, I was thinking to change the max length to 295, instead of 300 in the screen.seq commands. I am not sure though if it will make a difference.

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, count=stability.contis.count_table, maxambig=0, maxlength=300, maxhomop=8)

For your question, Sure.Here is the output for the command .
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align,count=stbility.trim.contigs.good.count_table)

                Start	End	  NBases	  Ambigs	  Polymer	    NumSeqs

Minimum: 1 1231 2 0 1 1
2.5%-tile: 1 13424 291 0 4 354006
25%-tile: 1 13424 292 0 4 3540059
Median: 1 13424 292 0 5 7080117
75%-tile: 1 13424 292 0 6 10620175
97.5%-tile: 1 13424 293 0 6 13806227
Maximum: 13422 13424 300 0 8 14160232
Mean: 16 13412 291 0 5

of unique seqs: 1499742

total # of seqs: 14160232

The error came in during the sequencing stage and unfortuantely that’s what you’ve got now. If there really is no way to resequence the data, I’d use overlap=T and then pursue the phylotype based approach since it is less sensitive to sequencing errors.


Thanks! Did overlap=T and added a quality score in make contigs, the unique sequences has decreased a lot now.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.