Issues with cluster command

Hello,

I’ve been having difficulty getting my sequences to cluster. I’m working with 16S Illumina sequences, and I have a large number of uniques immediately before making my distance matrix (249,539), but not more than I’ve seen other users here report. I’ve tried running both cluster and cluster.split (splitting using the distance matrix) and neither seems to work. With cluster, mothur will run until the .list file hits a certain size and then it simply stops (I’ve tried this several times and it always stops in the same place). With cluster.split, if I set large=T it will generate temporary .dist files right up until I run out of resources to run the program with, but it never progresses beyond splitting the file.

Here’s both commands:
cluster(column=.dist, cutoff=0.1, hard=T, method=average, name=.names)
cluster.split(column=.dist, cutoff=0.1, hard=T, large=T, method=average, name=.names, processors=24)

I never get mothur errors, I just run out of walltime on our servers. Please let me know if you’ve encountered this issue before, and if you have whether there’s a workaround.

Thank you!

What commands are you running?

fastq.info
trim.seqs
uniqe.seqs
split.abund (with cutoff=1)
align.seqs (with silva.gold.fasta as a reference)
screen.seqs
filter.seqs
chimera.uchime (using silva.gold.fasta as a reference)
remove.seqs (removing chimeras)
sub.sample (to 120,000 sequences per group)
dist.seqs
cluster

As far as I can tell, everything is working normally (albeit with a lot of resources) up until the cluster step. As a note, I’m using mothur 1.27.0 because 1.28.0 isn’t available on our servers. I even seem to be creating a distance matrix without problems, it’s just huge. I’m wondering if the size of the distance matrix is messing up the cluster step. Is this a known issue?

The problem is that you are using up all the memory. I’d remove the split.abund line and add unique.seqs and pre.cluster after filter.seqs as you would in the 454 SOP.

Pat

I tried adding in a pre.cluster step (clustering within groups), and it did cut down the number of unique sequences I have just before making the distance matrix (to ~200K). I ended up with a ~22 GB distance matrix (with a cutoff of 0.2), but the cluster step is still failing in the same way. For what it’s worth, I tried subsampling down to 20K sequences per sample, got a distance matrix of about 4 GB, and can only get cluster to finish if I set the cutoff down to 0.03 - which is useless, because it actually clusters at 0.001 and therefore each OTU represents a single unique sequence.

Searching the forums for other people with this problem, it seems like people who have 60+ GB distance matrices have issues, but I didn’t see anyone having issues with a distance matrix of 22 GB and certainly not 4 GB. I’m running this remotely on supercomputer servers, and have the same issues whether I ask for 16 GB of RAM or 64 GB to run mothur. I also run each step on a new instance of mothur, so any and all cases of using up all the memory are coming from this individual (cluster) step. It just seems absurd to me that mothur is tripping up over a 4 GB distance matrix, but maybe that’s because my experience is limited - have you had trouble with distance matrices of this size before?

Hello Pat,

The issue seems to have cleared up - we’re pretty sure it was a server problem, not a mothur issue. I’m sorry to have taken up your time, and thank you for the help! The precluster step did help cut down the file sizes, and we’ll incorporate it when we have sequencing runs this large in the future.

Thank you very much!