Cluster.split problem in second step of clustering

I have tried to run cluster.split command but it never does the clustering part. I have also tried running it into two parts, but still facing issues (even when I run it on HPC).
These are the commands I ran:

chimera.vsearch(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.fasta, count= obesasth.trim.contigs.unique.good.filter.unique.precluster.count_table)
classify.seqs(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table, reference=trainset18_062020.pds.fasta,
remove.lineage(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table,, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)


Here is my output for summary:
Using 40 processors.

            Start   End     NBases  Ambigs  Polymer NumSeqs

Minimum: 1 632 220 0 3 1
2.5%-tile: 1 669 252 0 3 36094
25%-tile: 1 669 252 0 4 360931
Median: 1 669 253 0 4 721862
75%-tile: 1 669 253 0 5 1082792
97.5%-tile: 1 669 253 0 6 1407629
Maximum: 2 669 272 0 8 1443722
Mean: 1 668 252 0 4

(number) of Seqs: 1443722

It took 10 secs to summarize 1443722 sequences.

After this, I am running cluster.split

cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=4, cluster=f, processors=36)

Using 36 processors.
Splitting the file…
Selecting sequences for group Enterobacterales (1 of 129)
Number of unique sequences: 153062

I get the final.0.count.temp and final.0.count.dist type of files, but in the second step of cluster.split, it just keeps going endlessly. What should I do?


One think that is stricking for me is that you do not seem to be using the unique command and your summary does not contain the number of unique sequences. So it feels like you are clustering 1.5 million of reads, which is a lot and will take forever even on a good cluster.

Best of sucess.

oh, I just saw the name of your files and you did run unique. How come in the summary you are posting the number of unique is not showing up, that’s weird I think.

Hi Alexandre,
Just after running the summary when I run cluster.split, it shows the number of unique sequences.
cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=4, cluster=f, processors=36)

Using 36 processors.
Splitting the file…
Selecting sequences for group Enterobacterales (1 of 129)
Number of unique sequences: 153062

Selected 4735881 sequences from final.count_table.

Calculating distances for group Enterobacterales (1 of 129):
bla bla bla…

I am sorry, I cannot help further. Sometime, when my student get errors that are hard to solve, I always tel them to delete all files generated by Mothur and to start from a clean environment to generate all files needed just to make sure everything is always perfect. Also, we always use “current” in our pipeline instead of the full name file to make sure that there is no mismatch with any anterior files. That’s my only advice as for now.

I am working on Mothur 1.47, using 32 processors and 128 000 M for memory and I cluster.split at tax level 6 (genus,because past reasons too long to explain) before clustering at .03 and never got any problems with a new dataset I am currently analysing that contains a final number of 52 million sequences containing 430 000 uniques.

Best of success,

Is it possible that the data are 2x150 rather than 2x250 reads? If they are 2x150 I suspect the error rates will be quite high since the reads don’t fully overlap the V4 region.


Yes my data contains the 2*150 reads. But is there any workaround to improve the alignment and overcome the subsequent issues?

Not really…and that’s why your clustering is taking forever (other post).

What I would do to “save” the experiment: I would use maxEE in the make.contigs to remove really bad reads. I would use the following options ine the make.contig: deltaq = 0, maxee= 2 or 3 or even 5 (I would adjust maxee based on the number of unique after pre.cluster and of course based on the results of for my positive and negative controls).

Best of sucess.

Thank you so much for your support. Today I checked the results of my experiment after the weekend and thankfully the process has completed. It did not crash this time. But there are 257877 OTUs in all from 1.5 million reads. Is this normal?

The process completed successfully. I am now confident that I can run this process with the resources available to me. I have one query regarding the final.opti.mcc_shared file.
Is the order of the samples preserved in this file? I mean is the order of samples the same as what we have when we start the pipeline from make.contigs?

I am running my pipeline again on the same dataset to reduce the number of OTUs. So when I used the parameters deltaq = 0, maxee= 2 with make.contig, I got blank file obesasth.trim.contigs.fasta. What does this mean and what can I do with this?

Hello! Means you read are not passing qc. Increase max ee value untill you get satisfactory results. Sorry there is in my view no easy way out of it.

Télécharger Outlook pour Android

I just realized that the reason I got the blank file was not because of maxee, but due to the addition of oligos file.This was the command I had given.
make.file(inputdir=’/mnt/client_sharedfolder/obesity/empty/metagenafterbwtphixhg’, type=fastq, prefix=obesasth)
make.contigs(file=current, maxambig=0, maxlength=275, maxhomop=8, maxee=2, deltaq=0, oligos=test.oligos)

Here is the content of my oligos file.

Can you suggest why it happened? I mean why I got zero reads in my contigs file when I used oligos option?
Now I have removed the oligos option, and I am getting contigs in my file obeasth.trim.contigs.fasta

I kinda do not know if you are using the right primers. the other thing you can try is to use pdiff=2 to allow some difference between your reads and primers before the read is deleted. Also, maybe you do not have primers attached to your reads anymore and since the primers are never found, all reads are automatically deleted? In the earlier post that you made, you have a length of 253, was this using oligo or not?

These are the standard primers (515f and 806r) I think the primers are not there in my files, because I had run cutadapt before starting the Mothur pipeline. Even cutadapt did not do anything to my files- the ‘before’ and ‘after’ files were just the same. So I think I can safely accept that the primers are not included in my reads.
The length that I got earlier-253 was without the use of any oligo file.

After running the mothur script yesterday, with the maxee=2 (and deltaq=0, and no oligo parameter), here is my summary after precluster:

Using 40 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 631 220 0 3 1
2.5%-tile: 1 669 252 0 3 848227
25%-tile: 1 669 252 0 4 8482261
Median: 1 669 253 0 4 16964522
75%-tile: 1 669 253 0 5 25446783
97.5%-tile: 1 669 254 0 6 33080817
Maximum: 2 669 270 0 8 33929043
Mean: 1 668 252 0 4

of unique seqs: 2265207

total # of seqs: 33929043

It took 41 secs to summarize 33929043 sequences.

Output File Names:

The number of sequences is more than what I had when I did not set maxee=2 and deltaq=0. I don’t understand why this is so. What is the default value of maxee?

Hello! I am glad you found out what was the problem. I would finish the run and see after chimera removal and see if I have less unique then last time. Best of sucess!

Télécharger Outlook pour Android

Hello, I have figured out the problem about primer thing. However I am still stuck because I don’t know what to do reduce the number of sequences I am getting. The number of sequences I get now is more than the number I was getting earlier after chimera removal. So adding maxee=2 did not do any good. What is the default value of maxee/

Here is the summary I get after chimera removal and classify.seqs:


Using 40 processors.

            Start   End     NBases  Ambigs  Polymer NumSeqs

Minimum: 1 631 220 0 3 1
2.5%-tile: 1 669 252 0 3 51134
25%-tile: 1 669 252 0 4 511337
Median: 1 669 253 0 4 1022673
75%-tile: 1 669 253 0 5 1534009
97.5%-tile: 1 669 253 0 6 1994211
Maximum: 2 669 270 0 8 2045344
Mean: 1 668 252 0 4

of Seqs: 2045344

It took 13 secs to summarize 2045344 sequences.

Output File Names:


You can always try maxee=1 if you want. Another thing that I found this morning with test I have run during the night is that to be more aggressive in the alignment step reduces my unique number and I get slightly better error rates. I am using now match=2, mismatch=-2 in the options instead of the default. That is the most I will be tinkering with the original SOP for now at least.

At some point you will have to decide on how you like your data to look like and choose the pipeline that best suites you. Downstream analysis can also help into decision making.

Best of success

Hello, I have some interesting results to share. In my last experiment, I has 1443722 sequences going to cluster.split which gave me 25877 OTUs when taxon level=4.
This time after setting some parameters in make.contigs, I got 913717 sequences going to cluster.split. I have got 78715 OTUs when taxon level=5.
Is this a realistic number and good to take further for downstream analysis?

Thank you

Morning! My advise: just try it!

Télécharger Outlook pour Android