I’m trying to run sens.spec on a couple of different cluster methods for a largeish dataset (2.5M reads, which after following the SOP yields 300k unique). This is 1/3 of my total dataset so I’d like to determine how well a greedy clustering algorithm (ChrunchClust) does compared to average heuristic clustering. (fwiw, I’ve read your paper on this and believe you when you say average heuristic is better. But I’m pushing the limits of what I can do computationally with this dataset and need to determine if it’s a significant enough improvement to justify the massive increase in time and computing power over ChrunchClust). My column dist file is 18gb, sens.spec will run for approx 1 hour then kill itself, I think because of RAM limitations (I’m running on one processor with 24gb physical ram and 80gb swap). is there a way to do this on such a large file? I have access to a cluster computing facility with up to 256gb ram per node, however I haven’t used it yet because I don’t know how to estimate how much ram and processing time I’ll need for this process. So, I guess I don’t know if this is a bug that causes the crash (not sure why 104gb RAM+virtual memory isn’t enough for an 18gb distance matrix)? thanks
Could you be accidentally entering a column formatted distance file using the phylip parameter?
I was using the correct distance format (it won’t run if you specify the wrong one) but did get it to work by using a dereplicated list. so my list file only has 300k unique/precluster sequences rather than the full 1.9M. The numbers this way are sort of off-does one tp really mean one or does that seq represent 10k other seqs. But oh well, good enough to tell how roughly how crunchclust did compared to aligned, complete clustering