Cluster.split – limit of sequences it can handle?

Hi Pat and Sarah,

Following the MiSeq SOP, I’m trying to cluster 2,160,000 unique sequences (total of 13 million sequences - amplicon size 350 bp) of 219 samples. I’m running the cluster.split command (taxlevel=4, cutoff=0.15, no use of “large=T”) with 64 cores and 400G RAM (of which it is using 100G).The program has been running for 12 days and I’m a bit worried it won’t finish.

Hence, my question is there a maximum of sequences cluster.split (or Mothur in general) can handle and is there something I could do to speed up the process? We plan to analyse even more samples in the future…

My apologies if this has been already answered on the forum – I realise there have been a few posts on this subject but I couldn’t find one that fits the above.

Thanks!
Miriam

It will probably continue running for a very long time. you can try using taxlevel=5 or 6.

The problem is that you have 350 bp amplicons, which indicates that your reads do not fully overlap with each other and that you likely have a high error rate. This will inflate the number of unique sequences you have (as well as the number of OTUs and the distance between samples). I strongly encourage people to use the V2 chemistry and sequence the V4 region to get proper denoising of your data.

Pat

Thanks Pat!

We used the V2 and V3 chemistry and covered the V4/5 region.

I’ll try taxlevel 5 or 6. If that fails - while not ideal, would it be possible to remove rare/singleton sequences before cluster.split (being aware that not all erroneous sequences are rare and not all rare sequences are erroneous but it would reduce spurious OTUs given our likely high error rate…) e.g. by using split.abund (remove.rare?)?

Thanks!
Miriam

The v3 chemistry has been a disaster. We’re still advocating sequencing the V4 region with the v2 chemistry.

Pat