Problems when using cluster.split on huge .dist file

mafernandez · August 19, 2019, 2:10pm

Hello there.
This is a continuation of the topic posted at:

As it has been closed due to inactivity.

I have a problem when trying to cluster eukaryotic sequences using either cluster or cluster.split.

After several tries where the system collapsed I ran dist.seqs with the column output option and then tried to run cluster.split like this:

cluster.split(column=Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist, count=Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, cutoff=0.03, large=T)

It is, with all the requierments to avoid the use of too much memory.
However, I got something like this:

Using 16 processors.
Splitting the file...
It took 529530 seconds to split the distance file.
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.33.temp
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.209.temp
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.115.temp
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.451.temp
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.39.temp
(...)
Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.807.temp

Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.658.temp

Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.724.temp

Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.1118.temp

Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.1293.temp

(...)

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
0	1428050402	0	1	0	1428050402	0	1	0	1428050402	0	1	0	 
   1428050402	0	1	0	1428050402	0	1	
    tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

(...)

And then, the process stops.
Any idea of what is happening? Could it be a problem of disk space? I am running mothur on a windows server with 16 processors and 64gb of RAM

Thanks

westcott · August 20, 2019, 6:46pm

Here are a few suggestions:

Run cluster.split with fasta, count, taxonomy files and set cluster=f. https://mothur.org/wiki/Cluster.split#file

mothur > cluster.split(fasta=final.fasta, name=final.names, taxonomy=final.taxonomy, taxlevel=4, cluster=f, cutoff=0.03) - splits files and creates distance matrices. By running the splitting separately from the clustering you can increase the processors used in the splitting and reduce the processors used in the clustering which is much more memory intensive. You also eliminated the need to re-split in the clustering fails which can be a major time saver.
mothur > cluster.split(file=final.file, processors=4) - you may need to run with with processors=1 so only one distance matrix is loaded into memory at a time.

The command could be failing when it tries to find the final sensspec results. Try running with runsensspec=f. This will also stop mothur from creating the complete distance matrix which can use a lot of hard disk space.
Use phylotype clustering. This may be required if your dataset is too large to clustering by distance.
Try running mothur using Amazon, https://mothur.org/wiki/Mothur_AMI

system · August 30, 2019, 6:55pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
HUGE dist file when running Eukarya analysis Commands in mothur	3	618	August 10, 2019
Enormous dist file Commands in mothur	5	2128	October 15, 2015
Issues with cluster command Commands in mothur	5	4454	December 19, 2012
Successful use of cluster.split with Windows? mothur bugs	3	1045	August 31, 2020
cluster.split Commands in mothur	4	1283	May 26, 2017

Problems when using cluster.split on huge .dist file

Related topics