Tree building on a very large data set

Hi all,

I have a large data set with 140,000 sequences or so after post-production removal of chimeras and lower quality sequences. I’m able to build a distance matrix, though it is 63 gb. I’m trying to build a tree using clearcut through mothur, but I have a suspicion it’s frozen my computer. After a day of running, it seemed to have maxed out my memory (32 GB with 9 GB swap all used). It’s been running for over 2 weeks now. Currently, it seems to use about 10% of one processor (out of 8), though the mothur prompt is still blinking. It hasn’t given me any error messages, but I think it’s frozen.

I’m wondering if there is an alternate way to analyze this data set and build a tree, or am I just out of luck since I don’t have access to a computer with more resources? Is there a way to have clearcut use hard drive space instead of memory like the hcluster command when building a distance matrix? Is there another tree-building program anyone can recommend that might be able to handle a data set this large? I’ve used fasttree a year ago, but haven’t used it since I switched to a Linux machine since Clearcut was supported in Mothur. Any suggestions or advice would be appreciated. Please let me know if you need further information. Thanks!

-Damon

Damon,

Are you only using unique sequences or are you including the redundant sequences as well? We aren’t too likely to muck with the clearcut code to do a hcluster-type work around and I’m not sure it would be too practical since it would be wicked slow. You might consider trying another tree building program. Sorry…

I’d use FastTree outside of mothur. Turn off bootstraps with “-noboot” though. Mothur should read this tree fine.

J

Hi Pat,

I’ve already removed unique sequences and putative chimeras from my samples. I’ve also done a precluster. I think that’s the smallest sample size I can get. I could use the trump command during the filtering, but I found that reduces my diversity numbers significantly.

I’ll look into using fasttree for this data set. Thanks J!

If anyone else has other tree building programs, I’d appreciate any suggestions. Thanks!

-Damon