I’ve successfully run mothur on large datasets (600+ samples) of full length 16S amplicons before using our institution’s supercomputer. However, I’m now running into a problem clustering ~700 infant oral samples. The job keeps timing out before completing the first taxonomic split.
I’ve divided the cluster.split into a split command and a separate cluster command using cluster=f. This normally does the trick, but in these samples around half the taxa are a single family (Streptococcaceae). Because of this, the very first split runs for ages on Streptococcaceae and the job ends up timing out. We get a max 7 days wall time on the supercomputer.
I’ve tried running at taxa level 4 and 5, but both time out before completing the first taxonomic split (Lactobacillales and Streptococcaceae, respectively).
For context, we have ~9.5 million sequences, ~3.5 million of which are unique. Around 2 million of those unique sequences are Streptococcaceae, which is probably why we are stuck on this first taxa.
Do you know of any work arounds for this? I don’t think that going to taxa level 6 will help since these are likely all Streptococci. Am I better off using dist.seqs followed by cluster?