I am running cluster.split on a large experiment with more than a thousand samples.
The exact command used is:
mothur > cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=5, cutoff=0.05)
Using v4_MiSeq.trim.contigs.good.unique.good.filter.precluster.denovo.uchime.pick.pick.subsample.count_table as input file for the count parameter.
Using v4_MiSeq.trim.contigs.good.unique.good.filter.precluster.pick.pick.subsample.fasta as input file for the fasta parameter.
Using v4_MiSeq.trim.contigs.good.unique.good.filter.precluster.pick.pds.wang.pick.subsample.taxonomy as input file for the taxonomy parameter.
Using 64 processors.
Using splitmethod fasta.
Comparing to previous runs with fewer samples, it appears that all the *.an.list files have been generated, but nothing else has happened since the last file has been written 5 days ago. There is a single mothur thread that is still running at %100 CPU utilization.
The problem appears to be due to a large number of unclassified Bacteria sequences, which are all split into the same group.
Is there a simple way to further split that group by distance before clustering?
The files may have been constructed, but I suspect there’s still one file that’s getting clustered. Also, I’m a bit worried that having clustered at 0.05, you’re not likely to get back clusters at 0.03. You probably want to set a higher threshold and try again. One thing you could try is to use taxlevel=6 (genus) although if your biggest group is at the kingdom level, this wouldn’t help (then again, i’d be surprised if your biggest group were at the kingdom level)
Thank you for your reply. I am not sure a fully understand you comment about setting a higher threshold.
Shouldn’t I expect to get back clusters at all dissimilarity levels up to 0.05? Is this an artifact of using the Average neighbor method?
What is the recommended threshold to use if I want to ensure I get the clusters at 0.03?
Regarding the long running process, I think I found a working solution.
The main issue was that I had a large group of unclassified Bacteria.
I have downloaded the latest PDS training set (Version 14), and removed the bootstrap cutoff threshold in classify.seqs, which was originally set to 80.
Instead I will apply it only in classify.otu.
This version is still running, but I suspect it will finish much faster based on the size of the generated groups.
I am still struggling to cluster this dataset using cluster.split.
My latest attempt failed after several days with the following error:
[ERROR]: Could not open 77256.temp
I have tried instead to use cluster using the concatenated distance matrix and the furthest method with the following command:
cluster(column=v4_MiSeq.trim.contigs.good.unique.good.filter.precluster.pick.pick.subsample.fasta.dist, count=current, method=furthest, cutoff=0.03)
However, the resulting file included only the unique label.
Any suggestions for why the other labels up to 0.03 are not being calculated?