cluster.split making TB of temp files

mscholz · September 29, 2016, 6:31pm

I have a large data set (234 samples), and when I run cluster.split, the program is making the temp distribution files, where each file is taking up 800 GB or more, which is causign the program to run out of hard-drive space.

My questions:

Is this normal?
Is this the result of too many sequences involved?
What are strategies to reduce the filesize (and presumably run time)?

command:
cluster.split(fasta=current, name=current, taxonomy=current, splitmethod=classify)

Kendra · September 29, 2016, 7:40pm

What kind of data do you have? You need more RAM than your largest .dist. You may need to increase your precluster diffs or taxlevel. I’ve clustered that many samples (v4 run on MiSeq 2x250, very diverse soils and waters) with diffs=3.

mscholz · September 30, 2016, 3:17pm

It is MiSeq 2X250,

after clustering at 2 diffs, there were >1 million unique sequences. Am running again using diffs=3 to see what happens.

Thanks!

pschloss · September 30, 2016, 3:49pm

What region did you sequence?

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/

Topic		Replies	Views
cluster.split Commands in mothur	1	1428	January 16, 2015
cluster.split(processors=32) regenerating large *temp? mothur bugs	1	1784	July 18, 2016
Cluster.split leaving temp files behind, but no error message Commands in mothur	2	478	June 11, 2021
problem with cluster.split...? mothur bugs	2	3073	December 29, 2014
Use cluster.split on MiSeq data Commands in mothur	15	13895	May 9, 2013

cluster.split making TB of temp files

Related topics