Problems when using cluster.split on huge .dist file

Hello there.
This is a continuation of the topic posted at:

As it has been closed due to inactivity.

I have a problem when trying to cluster eukaryotic sequences using either cluster or cluster.split.

After several tries where the system collapsed I ran dist.seqs with the column output option and then tried to run cluster.split like this:

cluster.split(column=Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist, count=Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, cutoff=0.03, large=T)

It is, with all the requierments to avoid the use of too much memory.
However, I got something like this:

Using 16 processors.
Splitting the file...
It took 529530 seconds to split the distance file.
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.33.temp
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.209.temp
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.115.temp
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.451.temp
Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.39.temp
(...)
Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.807.temp

Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.658.temp

Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.724.temp

Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.1118.temp

Clustering Eukarya_rocas.trim.contigs.good.unique.good.filter.unique.precluster.pick.dist.1293.temp

(...)

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
0	1428050402	0	1	0	1428050402	0	1	0	1428050402	0	1	0	 
   1428050402	0	1	0	1428050402	0	1	
    tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score

(...)

And then, the process stops.
Any idea of what is happening? Could it be a problem of disk space? I am running mothur on a windows server with 16 processors and 64gb of RAM

Thanks

Here are a few suggestions:

  1. Run cluster.split with fasta, count, taxonomy files and set cluster=f. https://mothur.org/wiki/Cluster.split#file

mothur > cluster.split(fasta=final.fasta, name=final.names, taxonomy=final.taxonomy, taxlevel=4, cluster=f, cutoff=0.03) - splits files and creates distance matrices. By running the splitting separately from the clustering you can increase the processors used in the splitting and reduce the processors used in the clustering which is much more memory intensive. You also eliminated the need to re-split in the clustering fails which can be a major time saver.
mothur > cluster.split(file=final.file, processors=4) - you may need to run with with processors=1 so only one distance matrix is loaded into memory at a time.

  1. The command could be failing when it tries to find the final sensspec results. Try running with runsensspec=f. This will also stop mothur from creating the complete distance matrix which can use a lot of hard disk space.

  2. Use phylotype clustering. This may be required if your dataset is too large to clustering by distance.

  3. Try running mothur using Amazon, https://mothur.org/wiki/Mothur_AMI

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.