Memory requirements for clustering 61GB distance file


I’m running average linkage clustering on our distance file (column @ 61GB, groups file @ 4,410,021 lines). I’ve specified a max memory usage on our cluster at 128GB and I’m still not able to read that entire file in, and so my process dies after about 3 hours accompanied by a :frowning: .

So, I’m trying to gauge how much memory I will need to process this file. I don’t really care how long the run takes at this point, I just want to get it to start. So:

1 - Can you estimate the memory required to read the distance file?
2 - Can you estimate the additional memory required for the clustering?

I guess cluster.split over MPI is my next option, but still have memory usage considerations similar to the above (i.e., how many nodes do I need to stay under the mem limit of some of the cluster nodes).


Can I assume a linear increase? If so, that’s ~150 hours and ~570 GB of RAM. Wait - I just noticed that these stats are from phylip matrices. Of course, my column matrix is much larger. So, I guess I’ll rerun it and get the phylip size and calc from there.


But I suggest you use column format which is much smaller than phylip format in size.

OK, so I’ve given up after days of waiting for any output from cluster() and moved to cluster.split(). I’m currently running the clustering on 50 CPUs across multiple cluster nodes with MPI.

1 - Is there a way to check the status of the splitting of the matrix? With GDB, I only get to the splitDistanceRAM() method, but not any actual progress. Is there a strategy that you use to check things like this?

2 - I can use the large=T param, but the splitting takes forever (on the order of 8 MB/min) so I’m looking at 122 hours or so just to split the matrix.

3 - Should I just shut up and wait for it to do its thing?


I guess I’m still confused why you have such a large distance matrix. We processed all of the HMP data (~40 million total reads) through the SOP and never saw anything that big. We certainly give you enough rope to do what you want, but if you aren’t following the protocol laid out in the SOP and you’re running into memory issues, then you might just be stuck. [sorry to seem harsh!]

Also, I wouldn’t split the distance file - I would run cluster.split by splitting the sequences by taxonomy. You’ll likely get something much more manageable and will probably do a better job with the memory.

Thanks, Pat. My data set is about 5M reads and, after following the SOP, is down to ~200K unique, aligned, screened, and filtered, pre-clustered reads. The matrix from 200K reads is the ~61GB in column format.

Are you slavishly following the SOP or just using the spirit of the SOP? Are you using minflows=450 and maxflows=450 in shhh.flows (the defaults)?

Also, you should use a cutoff if you aren’t - that’s the only reason your column matrix would be bigger than the phylip matrix.

Yes, I’m sticking pretty much to the SOP. I don’t have the flowgrams, but otherwise, it’s by the book. The commands are below. The group file name is non-standard b/c I created it on my own after removing homopolymers with trim.seqs (>6). This is also HMP data, V1-3, and it’s also processed by my other code that removes primers, barcodes, reads with N, and selects based on length range before even hitting mothur.


align.seqs(fasta=current, reference=silva.bacteria/silva.bacteria.fasta, processors=8)

screen.seqs(fasta=hmp_it_ct_vt.trim.unique.align, optimize=start-end, criteria=90, processors=8, name=hmp_it_ct_vt.trim.names,

filter.seqs(fasta=hmp_it_ct_vt.trim.unique.good.align, vertical=T, trump=., processors=8)

unique.seqs(fasta=hmp_it_ct_vt.trim.unique.good.filter.fasta, name=hmp_it_ct_vt.trim.good.names)

pre.cluster(fasta=hmp_it_ct_vt.trim.unique.good.filter.unique.fasta, name=hmp_it_ct_vt.trim.unique.good.filter.names, diffs=1)

system(cp hmp_it_ct_vt.trim.unique.good.filter.unique.precluster.fasta hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.fasta)

system(cp hmp_it_ct_vt.trim.unique.good.filter.unique.precluster.names  hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.names)

chimera.uchime(fasta=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.fasta, name=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.names, processors=8)

remove.seqs(accnos=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.uchime.accnos, fasta=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.fasta, name=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.names,

classify.seqs(fasta=hmp_it_ct_vt.trim.unique.good.filter.unique.precluster1.pick.fasta, template=rdp/trainset6_032010.rdp.fasta, taxonomy=rdp/, cutoff=80, processors=8)

remove.lineage(fasta=current, name=current, group=current, taxonomy=current, taxon=Cyanobacteria)

<file copy...>

dist.seqs(fasta=final.fasta, processors=20, cutoff=0.15)

Ok - I’d run chop.seqs to keep the first 200 or 250 bp of each sequence (see Fig. 1 of You’re keeping around all the sequencing detritus that represents biodiversity generated via sequencing.