cluster.split hangs before merging list files

aschwartz · October 15, 2015, 9:54pm

I am running cluster.split on a large experiment with more than a thousand samples.
The exact command used is:

mothur > cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=5, cutoff=0.05)
Using v4_MiSeq.trim.contigs.good.unique.good.filter.precluster.denovo.uchime.pick.pick.subsample.count_table as input file for the count parameter.
Using v4_MiSeq.trim.contigs.good.unique.good.filter.precluster.pick.pick.subsample.fasta as input file for the fasta parameter.
Using v4_MiSeq.trim.contigs.good.unique.good.filter.precluster.pick.pds.wang.pick.subsample.taxonomy as input file for the taxonomy parameter.

Using 64 processors.
Using splitmethod fasta.
...

Comparing to previous runs with fewer samples, it appears that all the *.an.list files have been generated, but nothing else has happened since the last file has been written 5 days ago. There is a single mothur thread that is still running at %100 CPU utilization.

The problem appears to be due to a large number of unclassified Bacteria sequences, which are all split into the same group.
Is there a simple way to further split that group by distance before clustering?

Thanks,

Ariel

pschloss · October 19, 2015, 2:44pm

Hi Ariel,

The files may have been constructed, but I suspect there’s still one file that’s getting clustered. Also, I’m a bit worried that having clustered at 0.05, you’re not likely to get back clusters at 0.03. You probably want to set a higher threshold and try again. One thing you could try is to use taxlevel=6 (genus) although if your biggest group is at the kingdom level, this wouldn’t help (then again, i’d be surprised if your biggest group were at the kingdom level)

Pat

aschwartz · October 19, 2015, 6:01pm

Hi Pat,

Thank you for your reply. I am not sure a fully understand you comment about setting a higher threshold.
Shouldn’t I expect to get back clusters at all dissimilarity levels up to 0.05? Is this an artifact of using the Average neighbor method?
What is the recommended threshold to use if I want to ensure I get the clusters at 0.03?

Regarding the long running process, I think I found a working solution.
The main issue was that I had a large group of unclassified Bacteria.
I have downloaded the latest PDS training set (Version 14), and removed the bootstrap cutoff threshold in classify.seqs, which was originally set to 80.
Instead I will apply it only in classify.otu.

This version is still running, but I suspect it will finish much faster based on the size of the generated groups.

Thanks,

Ariel

pschloss · October 29, 2015, 11:04am

You might look at this…

http://www.mothur.org/wiki/Frequently_asked_questions#Why_does_the_cutoff_change_when_I_cluster_with_average_neighbor.3F

I would suggest using a cutoff of 0.15 or 0.20.

aschwartz · November 25, 2015, 8:09pm

I am still struggling to cluster this dataset using cluster.split.

My latest attempt failed after several days with the following error:

[ERROR]: Could not open 77256.temp

I have tried instead to use cluster using the concatenated distance matrix and the furthest method with the following command:

cluster(column=v4_MiSeq.trim.contigs.good.unique.good.filter.precluster.pick.pick.subsample.fasta.dist, count=current, method=furthest, cutoff=0.03)

However, the resulting file included only the unique label.

Any suggestions for why the other labels up to 0.03 are not being calculated?

Thanks,

Ariel

Topic		Replies	Views
cluster.split issues Commands in mothur	1	1551	May 25, 2015
Cluster.split issue (again, sorry) mothur bugs	4	499	December 11, 2021
cluster.split Commands in mothur	4	1275	May 26, 2017
Cluster.split issue "Num_Dists_Below_Cutoff" Commands in mothur	4	1162	March 14, 2019
Issues with cluster.split command removing groups no list file provided and is stuck mothur bugs	3	414	January 17, 2022

cluster.split hangs before merging list files

Related topics