I have a large data set (234 samples), and when I run cluster.split, the program is making the temp distribution files, where each file is taking up 800 GB or more, which is causign the program to run out of hard-drive space.
My questions:
Is this normal?
Is this the result of too many sequences involved?
What are strategies to reduce the filesize (and presumably run time)?
What kind of data do you have? You need more RAM than your largest .dist. You may need to increase your precluster diffs or taxlevel. I’ve clustered that many samples (v4 run on MiSeq 2x250, very diverse soils and waters) with diffs=3.