Clustering for an low diversity, large dataset

Hi there,

I’ve successfully run mothur on large datasets (600+ samples) of full length 16S amplicons before using our institution’s supercomputer. However, I’m now running into a problem clustering ~700 infant oral samples. The job keeps timing out before completing the first taxonomic split.

I’ve divided the cluster.split into a split command and a separate cluster command using cluster=f. This normally does the trick, but in these samples around half the taxa are a single family (Streptococcaceae). Because of this, the very first split runs for ages on Streptococcaceae and the job ends up timing out. We get a max 7 days wall time on the supercomputer.

I’ve tried running at taxa level 4 and 5, but both time out before completing the first taxonomic split (Lactobacillales and Streptococcaceae, respectively).

For context, we have ~9.5 million sequences, ~3.5 million of which are unique. Around 2 million of those unique sequences are Streptococcaceae, which is probably why we are stuck on this first taxa.

Do you know of any work arounds for this? I don’t think that going to taxa level 6 will help since these are likely all Streptococci. Am I better off using dist.seqs followed by cluster?



Hi Lisa - what region are you sequencing and with what platform/chemistry? I worry you are falling prey to this…


Hi Pat,

We’re using PacBio Sequel II to sequence the full length 16S gene. We’ve always had a high portion of unique sequences using this method and have always needed to use cluster.split with cluster=f on large data sets. But this is the first time that one taxon is holding so many of the sequences.



This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.