Use cluster.split on MiSeq data


I have problem to run through cluster.split on my data set.

Currently, I am running through mothur for my MiSeq data, which is 105 samples, about 12M reads. After all of the preprocessing steps, I could get about 672000 unique reads, and total about 6M reads. Although cluster.split is not done yet, it takes about 2.4T storage room, and use large memory (>100GB). Is it normal? Since it requires large memory, my job got killed. Is there any way to make it run through cluster.split?

Any suggestions will be appreciated. I need to work on this data.

Thank you so much.

I am also facing this problem! Any response will be appreciated!

now it is stagnant at this status “Clustering mothur/data/fileg35alignfilter.unique.pick.phylip.column.dist.0.temp” . My data contain about 380,000 sequences. This status lasts for more than 36h. I wonder is it normal?

Thank you!

What region are you sequencing and are you using mothur v 1.30? What taxonomic level are you using in cluster.split? You might save your time and kill the process, that will never finish.

We sequenced V4, and I use mothur v.1.30.2. According to the MiSeq SOP, I used taxa=4.
Can I split my sample, and run each of them through mothur?


Sorry for all the questions - do you know when the sequencing was done, what the cluster density was and what %phix was used?

No worries. I would like to give more information if it helps to solve the problem. I do not know what the cluster density was and what %phix was used. I will find out. Can I ask why is it important of those information? After making contigs, we have 109022 reads for each sample.

I run cluster.split. I found out it run through calculating distance including 46 files, and it run out of RAM when it was clustering. The size of file named XXX.0.dist is 1.5T. With the number increasing, the size of distance files decrease. It seems that when mothur does clustering, it loads the distance file in the RAM. Since the distance file are huge. It didn’t go through.

Many thanks

Depending on what MiSeq software version was used for the sequencing, an inadequate amount of PhiX combined with too high a cluster density will result in an excessive number of sequencing errors leading to a large number of unique sequences like you are seeing.

As I understand, the large number of unique reads makes cluster step harder, and it can’t be done using reasonable RAM. Sequencing step can cause error, we may have to filter our reads in a more aggressive way. In our data, we got 672108 unique reads putting into cluster.split command. Since I have no experience to analyze MiSeq 16S rRNA, can I ask what is the maximum number of unique reads that mothur can handle using reasonable RAM?

Thank you for the advice, I will take close look at our data, and check if the unique reads could be caused by sequencing error. Can I use FastQC to check the sequencing error you mentioned here? Since our reads are 150nt long, PE, the overlap is about 50nt. For forward and reverse reads, both of them have 100nt not overlapped. I looked throuhg mothur MiSeq SOP, after making contigs, it seems screen step only look for the overlapped region of forward and reverse reads. Is it true? If it is true, I may need to do quality control before I put my reads through mothur pipeline.

Thanks for your patient. I really appreciate your help.

Here is the information about sequencing details:

30% phiX with and the cluster density was 1086k/mm2.



What version of the Real Time Analysis software was used?

Hi Pat,
I am also having the same problem. My data set has about 1.2M with over 800,000 unique sequences. Every time I try to run cluster.split the computer will crash after a while, even when I try to subsample it. The analysis has been running in a computer that has 8 processors for almost a week now, but it is still running.
What would be the consequences of subsampling the data after the screen.seqs command on your SOP?

The version of RTA software used for this amplicon pool was 1.17.28 and MCSVersion: 2.2.0


Thanks for the information. So more than 800 K/mm2 is high and generally results in higher error rates. Also, you mention that you are using paired 150 bp reads and that the region is 200 bp so you only overlap 50 bp. We have found that if the reads do not fully overlap (e.g. 2x250 flows to sequence the 253 bp V4 region) then your error rates will be considerably higher. The make.contigs command will align the two reads to each other and if there is a mismatch between the two reads it will use the read of the base that has a quality score at least 6 points higher. If the difference is less than 6, then an N is inserted. In the non-overlapping region there isn’t much one can do. For next time, I would use the 2x250 bp kit.

For now, you might try going to level 5 (family) or 6 (genus) to do the split. If that doesn’t work, then I’d recommend just working with phylotype data.


costamc, please start a separate thread and supply the information that ning provided.

Thanks Pat. I know where might go wrong for my data. I will try your suggestions. After the discussion, I get better understanding of my data, and how does mothur processes MiSeq data.

Thank you so much.