Segmentation fault when clustering a 1.44 GB dist file

This is the command I used:
mothur “#read.dist(column=seq.nr.aligned.dist, name=seq.names); cluster()”

Mothur was actually able to cluster at unique and 0.01, but not any further.

I have another 4GB+ dist file on the way, so if this doesn’t work I’m in a bit of trouble.

Is there any hope at all or am I hitting a hardware limit? Thanks.

How much RAM does your computer have? Are you already using a cutoff?

Also, v.1.7.0 will have an option to use an alternate form of the clustering algorithm, which is light on memory requirements. While it is faster for big files, it tends to be slower on small files.

I actually ran it on a cluster, and I allocated it 3.8GB of ram, which is the maximum for that node. I used a cutoff of 0.5 when I generated the distance matrix.

I think this is probably a hardware issue more than anything else. I may have access to a machine with 128GB of ram, so that will probably handle this particular distance matrix. Having said that, I now have another distance matrix file which is 45GB, and using the new algorithm in 1.7.0 may be the only solution for that.

You’re right that it is probably a hardware problem and I suspect you’ll be well served by the new feature in version 1.7. As an aside, I’m not sure what you gain by going out to 0.5 except for more distances that don’t correlate very well. I think the largest distance I’ve seen analyzed with full-length sequences is 0.20.

That’s correct, although we do sometimes look at 0.30 to get some ideas on phylum-level diversity. I wasn’t expecting the distance file to be so big, so I didn’t change the shell script used to run Mothur, which was set at 0.50 cutoff.

On a somewhat related topic, can you elaborate a little on how unique.seqs works? Do two sequences have to be the same length for them to be consider identical? I don’t think there’s that much diversity in the samples, and if we can play with unique.seqs a little, it may make downstream analyses less computationally intensive.

Thanks.

The sequences have to be identical in every way. One thing you can do is to run unique.seqs twice. For example, run unique.seqs on the raw reads and then align the uniques. Then screen and filter the alignment so that the sequences overlap on the same region (use the trump=. option). Then do unique.seqs again (be sure to include the current names and group file). This should cut things down significantly.