hcluster, average neighbor, large distance matrix

Hi All,

Not sure if this is a bug or not. I am trying to cluster a large distance matrix using AVERAGE neighbor…

I am using a Ubuntu box with 72GB of RAM and 8 processors to run hcluster on an 11.3GB distance file (~34k sequences) generated after all the usual steps (screen, unique, precluser, chimer.uchime, etc). Briefly, here is what I did…

mothur > dist.seqs(fasta=my.data.fasta, cutoff=0.20, processors=7)
It took 1624 to calculate the distances for 33798 sequences

ok, so far so good…

mothur > hcluster(column=my.data.dist, name=my.data.names, method=average, cutoff=0.20)
It took 3723 seconds to sort.

(I used the 0.20 cutoff because I wanted to make certain I got as many OTU cutoffs as possible. I ran a smaller dataset with full-length sequences using the same parameters and it stopped at 0.08.)

So it looks like mothur generated the initial .list, .rabund, and .sabund files but so far there is only data for the “unique” cutoff. Now it seems to be continually regenerating two files:

my.data.sorted.dist and my.data.sorted.dist.temp. The first file is 6.86GB and the second builds up to this size and then the process repeats without any addition to the list file. This has been running for ~12 hours.

Since the machine has 72GB of RAM i also tried just running cluster instead of hcluster. This process used about 40% of the memory and ran for 24 hours but still never got past the “unique” cutoff.

I have not encountered this issue using fn or nn so I wonder if it has something to do with an? Any suggestion?

While I am on the subject, hcluster has a “sorted” option. In what way is a matrix sorted?

Thanks!

jarrod

The hcluster command will run forever with a file that size. Since you have the RAM, cluster is the way to go. Here’s a little background on the sorted question. The hcluster algorithm works by NOT storing all the distances in memory. Instead it creates a sorted column formatted distance matrix. For the furthest and nearest clustering options this method works well, since you can grab the first distance from the file and update the clusters without having to know all the rest of the distances. Some intermediate storage is used for furthest neighbor, but it’s minimal. For average neighbor, you need to read through the sorted file(my.data.sorted.dist), grabbing all the distance related to the cells you are merging and update them, creating a new file(my.data.sorted.dist.temp). Once you are done processing my.data.sorted.dist, it is deleted and the merged file replaces it. All the reading and writing to disk is time intensive. As to the “sorted” parameter, it is used if you already had a column formatted file that was sorted. It is really only useful if you wanted to compare clustering methods, and to save time you wanted to sort your distance file outside of mothur with a command like, “sort -n -k +3 yourDistanceFile -o yourNewSortedDistanceFile” In your case the sort took about an hour, so sorting outside of mothur if you were going to run the file with different methods could be useful.