Memory requirements for clustering 61GB distance file

cfriedline · December 8, 2011, 7:59pm

Hi,

I’m running average linkage clustering on our distance file (column @ 61GB, groups file @ 4,410,021 lines). I’ve specified a max memory usage on our cluster at 128GB and I’m still not able to read that entire file in, and so my process dies after about 3 hours accompanied by a .

So, I’m trying to gauge how much memory I will need to process this file. I don’t really care how long the run takes at this point, I just want to get it to start. So:

1 - Can you estimate the memory required to read the distance file?
2 - Can you estimate the additional memory required for the clustering?

I guess cluster.split over MPI is my next option, but still have memory usage considerations similar to the above (i.e., how many nodes do I need to stay under the mem limit of some of the cluster nodes).

Thanks,
Chris

westcott · December 9, 2011, 1:33pm

http://www.mothur.org/wiki/Cluster_stats

cfriedline · December 9, 2011, 2:02pm

Can I assume a linear increase? If so, that’s ~150 hours and ~570 GB of RAM. Wait - I just noticed that these stats are from phylip matrices. Of course, my column matrix is much larger. So, I guess I’ll rerun it and get the phylip size and calc from there.

Thanks.

shuixia100 · December 10, 2011, 8:10am

But I suggest you use column format which is much smaller than phylip format in size.

cfriedline · December 13, 2011, 3:50pm

OK, so I’ve given up after days of waiting for any output from cluster() and moved to cluster.split(). I’m currently running the clustering on 50 CPUs across multiple cluster nodes with MPI.

1 - Is there a way to check the status of the splitting of the matrix? With GDB, I only get to the splitDistanceRAM() method, but not any actual progress. Is there a strategy that you use to check things like this?

2 - I can use the large=T param, but the splitting takes forever (on the order of 8 MB/min) so I’m looking at 122 hours or so just to split the matrix.

3 - Should I just shut up and wait for it to do its thing?

Thanks,
Chris

pschloss · December 14, 2011, 9:23pm

I guess I’m still confused why you have such a large distance matrix. We processed all of the HMP data (~40 million total reads) through the SOP and never saw anything that big. We certainly give you enough rope to do what you want, but if you aren’t following the protocol laid out in the SOP and you’re running into memory issues, then you might just be stuck. [sorry to seem harsh!]

Also, I wouldn’t split the distance file - I would run cluster.split by splitting the sequences by taxonomy. You’ll likely get something much more manageable and will probably do a better job with the memory.

cfriedline · December 15, 2011, 5:24am

Thanks, Pat. My data set is about 5M reads and, after following the SOP, is down to ~200K unique, aligned, screened, and filtered, pre-clustered reads. The matrix from 200K reads is the ~61GB in column format.

pschloss · December 15, 2011, 1:04pm

Are you slavishly following the SOP or just using the spirit of the SOP? Are you using minflows=450 and maxflows=450 in shhh.flows (the defaults)?

Also, you should use a cutoff if you aren’t - that’s the only reason your column matrix would be bigger than the phylip matrix.

cfriedline · December 15, 2011, 4:29pm

Yes, I’m sticking pretty much to the SOP. I don’t have the flowgrams, but otherwise, it’s by the book. The commands are below. The group file name is non-standard b/c I created it on my own after removing homopolymers with trim.seqs (>6). This is also HMP data, V1-3, and it’s also processed by my other code that removes primers, barcodes, reads with N, and selects based on length range before even hitting mothur.

unique.seqs(fasta=hmp_it_ct_vt.trim.fasta)

align.seqs(fasta=current, reference=silva.bacteria/silva.bacteria.fasta, processors=8)

screen.seqs(fasta=hmp_it_ct_vt.trim.unique.align, optimize=start-end, criteria=90, processors=8, name=hmp_it_ct_vt.trim.names, group=hmp_it_ct_vt.group)

filter.seqs(fasta=hmp_it_ct_vt.trim.unique.good.align, vertical=T, trump=., processors=8)

unique.seqs(fasta=hmp_it_ct_vt.trim.unique.good.filter.fasta, name=hmp_it_ct_vt.trim.good.names)

pre.cluster(fasta=hmp_it_ct_vt.trim.unique.good.filter.unique.fasta, name=hmp_it_ct_vt.trim.unique.good.filter.names, diffs=1)

system(cp hmp_it_ct_vt.trim.unique.good.filter.unique.precluster.fasta hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.fasta)

system(cp hmp_it_ct_vt.trim.unique.good.filter.unique.precluster.names  hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.names)

chimera.uchime(fasta=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.fasta, name=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.names, processors=8)

remove.seqs(accnos=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.uchime.accnos, fasta=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.fasta, name=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.names, group=hmp_it_ct_vt.trim.fasta.good.group)

classify.seqs(fasta=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.pick.fasta, template=rdp/trainset6_032010.rdp.fasta, taxonomy=rdp/trainset6_032010.rdp.tax, cutoff=80, processors=8)

remove.lineage(fasta=current, name=current, group=current, taxonomy=current, taxon=Cyanobacteria)

<file copy...>

dist.seqs(fasta=final.fasta, processors=20, cutoff=0.15)

pschloss · December 15, 2011, 4:51pm

Ok - I’d run chop.seqs to keep the first 200 or 250 bp of each sequence (see Fig. 1 of http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027310). You’re keeping around all the sequencing detritus that represents biodiversity generated via sequencing.

Topic		Replies	Views
RAM needed for cluster.split	6	990	November 30, 2021
Clustering a 10GB distance matrix mothur bugs	2	4197	March 16, 2011
Command cluster_issue	17	888	November 28, 2021
cluster.split() and MPI memory usage bug? mothur bugs	0	4277	December 13, 2011
Cluster.split and computer characteristics	7	1847	October 23, 2019

Memory requirements for clustering 61GB distance file

Related topics