Pre.Cluster uses too much memory, time

Hi All,

I think I’m not the only one to have this problem, but I’m wondering if there are any solutions:

I have 15M MiSeq reads of the V3V4 region and I’ve gotten to the pre.cluster step, and it’s going too slowly to be useful. At first I tried it with 6 processors, but then aborted it when it got to be using >85% of my memory. I retired with 4 and it was the same thing. Yesterday afternoon I started it again with just 2 and it’s been 20 hours without any sign of it being done.

Does running it with just 1 processor fix this at all, or am I just going to wait another 20 hours to find out it isn’t working?

I have 8gb of memory and task manager tells me I’m using 18gb of virtual address space, which makes me think that even if running it on fewer processes helps, it won’t be enough.

Does anyone have any other way around this? Does running it on something like an Amazon web server work? Do I have to just wait for it to finish? …Is QIIME any faster? I really can’t afford to have this take several days, my thesis is due in almost a week and I need this data as soon as possible.

Thank you for the help!

I tried it on an Amazon web server with 160 GB Memory using 23 processors:

mothur >
pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=4)

Using 23 processors.

Processing group JW-0002:
Error in reading your fastafile, at position -1. Blank name.

I got the error underlined above. People have said that this happens when a process hits the system limit on memory, but I was at about 155 of my 160. The one thing is that the virtual hard drive on the AWS machine is very small (only 4Gb of free space). Could it be that it’s trying to write files to disk that are larger than that?

Why else might I be getting this error? I’m retrying it with only 16 processors, I’ll see if that works.

Another update:

With 16 processors it ran fine without error, but it still hasn’t finished. It has been about 2 hours since it’s given any new output in the terminal, written any files, or had any CPU usage. Is this because there is only 3 gb free? Does it need more to output? I deleted what I could but the server I got didn’t have much disk space to begin with.

Is it normal for it to seemingly do nothing for 2 hours, or is it never going to finish?


I’m sorry you’re running into problems with pre-cluster. The problem stems from the high error rate you are experiencing by not having your reads fully overlap. Even if you were able to get through pre-cluster, you’d struggle with chimera.uchime and clustering. The alternative would be to classify your sequences and then use phylotype. Although this doesn’t help you with your deadline, I would normally suggest regenerating V4 data so that the reads fully overlap so that you can use the OTU-based approach. Here’s a more extensive discussion of this issue…


Thanks for this.

I didn’t really have much control over how the sequencing was done, but I get that doing V3V4 wasn’t ideal.

I did subsample my data though, and it made it all the way through. However, a huge amount of my OTUs are only 1bp different from another OTU. If I preclustered with diffs=4 shouldn’t this not have been the case?

I also clustered with a 0.03 cutoff, shouldn’t that have grouped them better.

I ran:

mothur >
pre.cluster(fasta=sub.trim.contigs.good.unique.good.filter.unique.fasta, count=sub.trim.contigs.good.unique.good.filter.count_table, diffs=4)

Using 16 processors.

mothur> cluster.split(fasta=current, count=current, taxonomy=classify.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=16)

mothur > make.shared(list=current, count=current, label=0.03)
Using classify.count_table as input file for the count parameter.
Using as input file for the list parameter.

Output File Names:

mothur > classify.otu(list=current, count=current, taxonomy=current, label=0.03) Using classify.count_table as input file for the count parameter. Using as input file for the list parameter. Using classify.taxonomy as input file for the taxonomy parameter.

But, for example, here are my 2 most abundant OTUs, blasted against each other. They are >99% identical! Why??

Your link didn’t work, but I suspect your blast results trimmed the tails of the sequences where there were differences between the sequences. Look at the start/end positions on the blast alignment output. This is why we don’t encourage the use of blast.


That is definitely not the case. Below are representative sequences of the top 2 OTUs from clustering at 0.03:

It’s hard to read in this format, but copy and paste it into a text editor and you see that they are exactly identical except for 1 bp.

M00700_224_000000000-AN62B_1_1101_13457_2556 Otu000001|1
M00700_224_000000000-AN62B_1_1101_19498_2170 Otu000002|1

Hmmm. I worry that something weird happened when you changed file names and perhaps used the wrong file.

Can you run…

grep -A 1 “M00700_224_000000000-AN62B_1_1101_13457_2556” sub.trim.contigs.good.unique.good.filter.unique.precluster.fasta
grep -A 1 “M00700_224_000000000-AN62B_1_1101_19498_2170” sub.trim.contigs.good.unique.good.filter.unique.precluster.fasta

… and tell us what you get?

This is a common issue in our lab. We often have jobs killed for using too much memory given the number of processors. It is a bit of trial and error, but a ball park estimate is around 1 processor per 24-32 gb allocation e.g 4 processors with 96 gb.

Alternatively, use split.abund before hand and remove singletons. As Pat suggests, and I agree, poor overlap in contigs creates real headaches!