Clustering for an low diversity, large dataset

LisaS · July 26, 2023, 7:21am

Hi there,

I’ve successfully run mothur on large datasets (600+ samples) of full length 16S amplicons before using our institution’s supercomputer. However, I’m now running into a problem clustering ~700 infant oral samples. The job keeps timing out before completing the first taxonomic split.

I’ve divided the cluster.split into a split command and a separate cluster command using cluster=f. This normally does the trick, but in these samples around half the taxa are a single family (Streptococcaceae). Because of this, the very first split runs for ages on Streptococcaceae and the job ends up timing out. We get a max 7 days wall time on the supercomputer.

I’ve tried running at taxa level 4 and 5, but both time out before completing the first taxonomic split (Lactobacillales and Streptococcaceae, respectively).

For context, we have ~9.5 million sequences, ~3.5 million of which are unique. Around 2 million of those unique sequences are Streptococcaceae, which is probably why we are stuck on this first taxa.

Do you know of any work arounds for this? I don’t think that going to taxa level 6 will help since these are likely all Streptococci. Am I better off using dist.seqs followed by cluster?

Thanks,

Lisa.

pschloss · July 27, 2023, 5:30pm

Hi Lisa - what region are you sequencing and with what platform/chemistry? I worry you are falling prey to this…

Pat

LisaS · July 28, 2023, 1:12am

Hi Pat,

We’re using PacBio Sequel II to sequence the full length 16S gene. We’ve always had a high portion of unique sequences using this method and have always needed to use cluster.split with cluster=f on large data sets. But this is the first time that one taxon is holding so many of the sequences.

Thanks,

Lisa.

system · August 7, 2023, 1:13am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issues with splitting Commands in mothur	2	200	August 31, 2023
Facing clustering issue Commands in mothur	10	328	September 19, 2022
Using cluster.split with large data Commands in mothur	2	2696	March 31, 2014
Clustering a large dataset Commands in mothur	6	1122	February 8, 2019
Cluster split command running for days Commands in mothur	3	379	July 1, 2022

Clustering for an low diversity, large dataset

Related topics