Segmentation fault when clustering a 1.44 GB dist file

Saxphile · November 12, 2009, 12:48am

This is the command I used:
mothur “#read.dist(column=seq.nr.aligned.dist, name=seq.names); cluster()”

Mothur was actually able to cluster at unique and 0.01, but not any further.

I have another 4GB+ dist file on the way, so if this doesn’t work I’m in a bit of trouble.

Is there any hope at all or am I hitting a hardware limit? Thanks.

pschloss · November 12, 2009, 1:10pm

How much RAM does your computer have? Are you already using a cutoff?

Also, v.1.7.0 will have an option to use an alternate form of the clustering algorithm, which is light on memory requirements. While it is faster for big files, it tends to be slower on small files.

Saxphile · November 13, 2009, 3:04am

I actually ran it on a cluster, and I allocated it 3.8GB of ram, which is the maximum for that node. I used a cutoff of 0.5 when I generated the distance matrix.

I think this is probably a hardware issue more than anything else. I may have access to a machine with 128GB of ram, so that will probably handle this particular distance matrix. Having said that, I now have another distance matrix file which is 45GB, and using the new algorithm in 1.7.0 may be the only solution for that.

pschloss · November 13, 2009, 1:19pm

You’re right that it is probably a hardware problem and I suspect you’ll be well served by the new feature in version 1.7. As an aside, I’m not sure what you gain by going out to 0.5 except for more distances that don’t correlate very well. I think the largest distance I’ve seen analyzed with full-length sequences is 0.20.

Saxphile · November 14, 2009, 9:58am

That’s correct, although we do sometimes look at 0.30 to get some ideas on phylum-level diversity. I wasn’t expecting the distance file to be so big, so I didn’t change the shell script used to run Mothur, which was set at 0.50 cutoff.

On a somewhat related topic, can you elaborate a little on how unique.seqs works? Do two sequences have to be the same length for them to be consider identical? I don’t think there’s that much diversity in the samples, and if we can play with unique.seqs a little, it may make downstream analyses less computationally intensive.

Thanks.

pschloss · November 14, 2009, 4:37pm

The sequences have to be identical in every way. One thing you can do is to run unique.seqs twice. For example, run unique.seqs on the raw reads and then align the uniques. Then screen and filter the alignment so that the sequences overlap on the same region (use the trump=. option). Then do unique.seqs again (be sure to include the current names and group file). This should cut things down significantly.

Topic		Replies	Views
Command cluster_issue	17	890	November 28, 2021
memory differences between cluster.classic and cluster? mothur bugs	1	2767	October 17, 2012
Issues with cluster command Commands in mothur	5	4453	December 19, 2012
Clustering a 10GB distance matrix mothur bugs	2	4197	March 16, 2011
cluster bug? mothur bugs	2	2707	February 7, 2013

Segmentation fault when clustering a 1.44 GB dist file

Related topics