Cluster.split I/O bottleneck?

I’m analyzing HiSeq V4 sequences following the MiSeq SOP.

After denosing and chimera removal, I had about 8016675 seqs and 2666065 unique seqs.

I’m currently running

cluster.split(fasta=fasta, count=count_table, taxonomy=taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.20, method=furthest, processors=24)

on a cluster where it has been running for more than a week.

However, the process is using less than 10% of a single core (it is running on a server with 64 cores). It also only appears to utilize 1GB of RAM.

From top: ~1GB of RAM and 7.6% of one CPU core
31398 rbjork    20   0 1023m 995m 3116 D  7.6  0.2   8679:05 mothur

I don’t wanna kill it as it’s still writing output. But the lack of CPU and RAM consumption make me worry that it has hit a serious I/O bottleneck.

Running command: dist.seqs(fasta=all.trim.contigs.good.unique.good.filter.filter.unique.precluster.pick.pick.fasta.0.temp, processors=24, cutoff=0.2)
Using 24 processors.

The file


is currently 2.4 Tb and growing.

Sorry, I realize how exhausted this question is:
I’ll try running


in order to reduce the number of unique sequences.

How much ram can you access? I bet you need to drop your cores significantly for this step. Here’s what I do for a 16 core high mem node (512gb ram)

#make otus for each Order individually, for very large datasets (hundreds of samples) you may need to decrease the taxlevel to 5 or even 6. If you use 6 you will likely only get 3% OTUs because the within group differences aren't always 5%

cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=16, cluster=f)

#if your data is very diverse you may end up with .dist files that are too large for the number of processers (if you are using 4 processors your dist file needs to fit in RAM 4 times. For my system, I have to drop the processors in this step if my largest .dist is over 64GB)
cluster.split(file=current, processors=4)

Thanks, I’ll try that. Though it might not be realistic to try to calculate a ~2.6Mx2.6M distance matrix in the first place.

using cluster.split(taxlevel=4) won’t make a 2m x2m matrix unless all the seqs are in the same order.

Curiosity, why are you using furthest neighbor linkage?