Cluster.split I/O bottleneck?

Johannes · November 29, 2016, 4:31pm

I’m analyzing HiSeq V4 sequences following the MiSeq SOP.

After denosing and chimera removal, I had about 8016675 seqs and 2666065 unique seqs.

I’m currently running

cluster.split(fasta=fasta, count=count_table, taxonomy=taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.20, method=furthest, processors=24)

on a cluster where it has been running for more than a week.

However, the process is using less than 10% of a single core (it is running on a server with 64 cores). It also only appears to utilize 1GB of RAM.

From top: ~1GB of RAM and 7.6% of one CPU core
31398 rbjork    20   0 1023m 995m 3116 D  7.6  0.2   8679:05 mothur

I don’t wanna kill it as it’s still writing output. But the lack of CPU and RAM consumption make me worry that it has hit a serious I/O bottleneck.

Running command: dist.seqs(fasta=all.trim.contigs.good.unique.good.filter.filter.unique.precluster.pick.pick.fasta.0.temp, processors=24, cutoff=0.2)
Using 24 processors.

The file

all.trim.contigs.good.unique.good.filter.filter.unique.precluster.pick.pick.fasta.0.temp

is currently 2.4 Tb and growing.

Johannes · November 30, 2016, 1:04pm

Sorry, I realize how exhausted this question is: http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/
I’ll try running

split.abund

in order to reduce the number of unique sequences.

Kendra · November 30, 2016, 2:32pm

How much ram can you access? I bet you need to drop your cores significantly for this step. Here’s what I do for a 16 core high mem node (512gb ram)

#make otus for each Order individually, for very large datasets (hundreds of samples) you may need to decrease the taxlevel to 5 or even 6. If you use 6 you will likely only get 3% OTUs because the within group differences aren't always 5%

cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=16, cluster=f)

#if your data is very diverse you may end up with .dist files that are too large for the number of processers (if you are using 4 processors your dist file needs to fit in RAM 4 times. For my system, I have to drop the processors in this step if my largest .dist is over 64GB)
cluster.split(file=current, processors=4)

Johannes · November 30, 2016, 7:31pm

Thanks, I’ll try that. Though it might not be realistic to try to calculate a ~2.6Mx2.6M distance matrix in the first place.

Kendra · November 30, 2016, 9:16pm

using cluster.split(taxlevel=4) won’t make a 2m x2m matrix unless all the seqs are in the same order.

Curiosity, why are you using furthest neighbor linkage?

Topic		Replies	Views
cluster.split Commands in mothur	10	10417	March 12, 2015
Cluster.split runtime problem Commands in mothur	6	3876	November 24, 2015
Use cluster.split on MiSeq data Commands in mothur	15	13898	May 9, 2013
Stuck at cluster.split -- how do I overcome RAM issue? Commands in mothur	12	12755	August 20, 2013
cluster of a large dataset Commands in mothur	3	2779	July 18, 2016

Cluster.split I/O bottleneck?

Related topics