Clustering for an low diversity, large dataset

I’ve successfully run mothur on large datasets (600+ samples) of full length 16S amplicons before using our institution’s supercomputer. However, I’m now running into a problem clustering ~700 infant oral samples. The job keeps timing out before completing the first taxonomic split.

I’ve divided the cluster.split into a split command and a separate cluster command using cluster=f. This normally does the trick, but in these samples around half the taxa are a single family (Streptococcaceae). Because of this, the very first split runs for ages on Streptococcaceae and the job ends up timing out. We get a max 7 days wall time on the supercomputer.

I’ve tried running at taxa level 4 and 5, but both time out before completing the first taxonomic split (Lactobacillales and Streptococcaceae, respectively).

For context, we have ~9.5 million sequences, ~3.5 million of which are unique. Around 2 million of those unique sequences are Streptococcaceae, which is probably why we are stuck on this first taxa.

Do you know of any work arounds for this? I don’t think that going to taxa level 6 will help since these are likely all Streptococci. Am I better off using dist.seqs followed by cluster?



Hi Lisa - what region are you sequencing and with what platform/chemistry? I worry you are falling prey to this…


We’re using PacBio Sequel II to sequence the full length 16S gene. We’ve always had a high portion of unique sequences using this method and have always needed to use cluster.split with cluster=f on large data sets. But this is the first time that one taxon is holding so many of the sequences.



