I’m analyzing HiSeq V4 sequences following the MiSeq SOP.
After denosing and chimera removal, I had about 8016675 seqs and 2666065 unique seqs.
I’m currently running
cluster.split(fasta=fasta, count=count_table, taxonomy=taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.20, method=furthest, processors=24)
on a cluster where it has been running for more than a week.
However, the process is using less than 10% of a single core (it is running on a server with 64 cores). It also only appears to utilize 1GB of RAM.
From top: ~1GB of RAM and 7.6% of one CPU core
31398 rbjork 20 0 1023m 995m 3116 D 7.6 0.2 8679:05 mothur
I don’t wanna kill it as it’s still writing output. But the lack of CPU and RAM consumption make me worry that it has hit a serious I/O bottleneck.
Running command: dist.seqs(fasta=all.trim.contigs.good.unique.good.filter.filter.unique.precluster.pick.pick.fasta.0.temp, processors=24, cutoff=0.2)
Using 24 processors.
is currently 2.4 Tb and growing.
Sorry, I realize how exhausted this question is: http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/
I’ll try running
in order to reduce the number of unique sequences.
How much ram can you access? I bet you need to drop your cores significantly for this step. Here’s what I do for a 16 core high mem node (512gb ram)
#make otus for each Order individually, for very large datasets (hundreds of samples) you may need to decrease the taxlevel to 5 or even 6. If you use 6 you will likely only get 3% OTUs because the within group differences aren't always 5%
cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=16, cluster=f)
#if your data is very diverse you may end up with .dist files that are too large for the number of processers (if you are using 4 processors your dist file needs to fit in RAM 4 times. For my system, I have to drop the processors in this step if my largest .dist is over 64GB)
Thanks, I’ll try that. Though it might not be realistic to try to calculate a ~2.6Mx2.6M distance matrix in the first place.
using cluster.split(taxlevel=4) won’t make a 2m x2m matrix unless all the seqs are in the same order.
Curiosity, why are you using furthest neighbor linkage?