cluster and cluster.split

Hi all,

I am trying to cluster a large database to assign OTUs and have been unsuccessful. I have tried both cluster and cluster.split, and using cutoffs including when running the dist.seqs command. I have been able to cluster the data when broken up into thirds so I think that it might be a RAM issue, but I don’t know for sure. There are 71545 unique sequences in my data set and the specs for the two machines I have tried so far are 32 GB of RAM with a quad core and 64 GB with 24 cores. Does anybody have any thoughts on whether these are sufficient amounts of RAM to cluster this many sequences, and or any other ideas or suggestions? Thanks! Tony

Sounds like a lot of unique sequences. Can you post the commands you are running from raw data to your attempts to cluster?

Thanks for the reply Pat, and sorry I haven’t checked back sooner. I obtained the data from 13 runs on a GS Jr. What follows is an example of the command progression from one of the runs;

sffinfo(sff=pfizer12.sff,flow=t,trim=t,fasta=t)
trim.flows(flow=pfizer12.flow,oligos=pfizer12.oligos.txt,pdiffs=2,bdiffs=0,fasta=T,minflows=450,maxflows=450)
shhh.flows(file=pfizer12.flow.files)
trim.seqs(fasta=pfizer12.shhh.fasta,name=pfizer12.shhh.names,oligos=pfizer12.oligos.txt,pdiffs=1,bdiffs=0,maxhomop=6,minlength=250,flip=T)
unique.seqs(fasta=pfizer12.shhh.trim.fasta,name=pfizer12.shhh.trim.names)
align.seqs(fasta=pfizer12.shhh.trim.unique.fasta,reference=silva.all.April2013.fasta,flip=t)
screen.seqs(fasta=pfizer12.shhh.trim.unique.align,name=pfizer12.shhh.trim.unique.names,group=pfizer12.shhh.groups,minlength=250,end=41790)
filter.seqs(fasta=pfizer12.shhh.trim.unique.good.align,vertical=T,trump=.)
unique.seqs(fasta=pfizer12.shhh.trim.unique.good.filter.fasta,name=pfizer12.shhh.trim.unique.good.names)
pre.cluster(fasta=pfizer12.shhh.trim.unique.good.filter.unique.fasta,name=pfizer12.shhh.trim.unique.good.filter.names,group=pfizer12.shhh.good.groups,diffs=2)
chimera.uchime(fasta=pfizer12.shhh.trim.unique.good.filter.unique.precluster.fasta,name=pfizer12.shhh.trim.unique.good.filter.unique.precluster.names,group=pfizer12.shhh.good.groups)
remove.seqs(accnos=pfizer12.shhh.trim.unique.good.filter.unique.precluster.uchime.accnos,fasta=pfizer12.shhh.trim.unique.good.filter.unique.precluster.fasta,name=pfizer12.shhh.trim.unique.good.filter.unique.precluster.names,group=pfizer12.shhh.good.groups)
classify.seqs(fasta=pfizer12.shhh.trim.unique.good.filter.unique.precluster.pick.fasta,template=nogap.all.April2013.fasta,taxonomy=silva.all.April2013.tax,cutoff=60)
remove.lineage(fasta=pfizer12.shhh.trim.unique.good.filter.unique.precluster.pick.fasta,name=pfizer12.shhh.trim.unique.good.filter.unique.precluster.pick.names,group=pfizer12.shhh.good.pick.groups,taxonomy=pfizer12.shhh.trim.unique.good.filter.unique.precluster.pick.April2013.wang.taxonomy,taxon=Bacteria;Cyanobacteria;-Eukaryota;)
system(copy pfizer12.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta pfizer12.final.fasta)
system(copy pfizer12.shhh.trim.unique.good.filter.unique.precluster.pick.pick.names pfizer12.final.names)
system(copy pfizer12.shhh.good.pick.pick.groups pfizer12.final.groups)
dist.seqs(fasta=pfizer12.final.fasta,output=lt)
cluster(phylip=pfizer12.final.phylip.dist,name=pfizer12.final.names,method=nearest,cutoff=0.25)

To combine the data from individual runs I used the merge.files command for the fasta, name, and group files after the trim.seqs command followed by the unique.seqs command on the merged data and on through the same progression where I have run into difficulties with clustering.

Tony

It looks like you are following the Schloss SOP with steps to reduce the amount of memory needed. If you are able to cluster in thirds then I would suspect it’s a RAM issue. Here is a link to some RAM estimates for cluster, http://www.mothur.org/wiki/Cluster_stats. How big is the dist file you are running with cluster? You might try using the cluster.split command. If you use cluster.split to cluster using taxonomy, set processors=1, the more processors you use the more RAM is required.

The dist file is a little over 17 MB with the output set at lower triangle. I tried running cluster.split on my box with 32 GB of RAM, but the program crashed and gave an error indicating there wasn’t enough memory. I am not sure if the conversion to the column-formatted matrix using the command was successful before the program terminated, but if it was the dist file that was generated is about 93 MB. I ran the cluster.split command on a server with 64 GB of RAM and it finished but the output made no sense (basically there were almost as many OTUs as sequences). We discovered later that that particular computer was also having memory issues. I guess I am mostly wondering if we need to find a cpu with more memory or to upgrade ours.

I am also a little suspicious that there are so many unique sequences when I combine the data considering that there are only between 23 and 30 thousand for each of the data-sets when I analyzed them in thirds. The data is all from similar types of samples so I don’t understand why the total is roughly additive when there should definitely be a lot of overlap among the groups. Does anybody have any idea why I am not seeing more sequences fall out when they are processed through the pipeline together? Thanks

Just to clarify what I posted earlier… Obviously we are having memory issues, but what I meant was whether this makes sense based on the number of sequences we are attempting to cluster. On the cluster stats page it seems like the dist file isn’t that much smaller than the one we are working with, but it contains far fewer sequences. Is this likely because the conversion to the column formatted file using the cluster.split command crashed before the conversion was complete? I will attempt to run the dist.seqs command again without the output specified as lt.

Did you mean 17MB or 17GB? With 71545 sequences going into dist.seqs and cluster I would expect a very large distance matrix, unless most of the sequences are above the cutoff. 32GB of RAM should be enough to cluster a 17MB distance matrix. Perhaps there is something else going on? Do you want to send your distance matrix and names file to mothur.bugs@gmail.com?

I reran the dist.seqs command without the output set as lt and with cutoff = 0.10. It took 16715 to calculate the the distances for the 71545 sequences. The size of the dist file is 108 MB. I am currently attempting to run the cluster.split command again using the newly generated dist file as column input. Do you think it is still worth sending my dist file and names file to the mothur.bugs email or should I wait to see if the column.split command runs? Thanks for all of the advice!

Hopefully everything is working for you now, but if not feel free to send your files. I am happy to help.