Not sure if this is a bug or not. I am trying to cluster a large distance matrix using AVERAGE neighbor…
I am using a Ubuntu box with 72GB of RAM and 8 processors to run hcluster on an 11.3GB distance file (~34k sequences) generated after all the usual steps (screen, unique, precluser, chimer.uchime, etc). Briefly, here is what I did…
mothur > dist.seqs(fasta=my.data.fasta, cutoff=0.20, processors=7)
It took 1624 to calculate the distances for 33798 sequences
ok, so far so good…
mothur > hcluster(column=my.data.dist, name=my.data.names, method=average, cutoff=0.20)
It took 3723 seconds to sort.
(I used the 0.20 cutoff because I wanted to make certain I got as many OTU cutoffs as possible. I ran a smaller dataset with full-length sequences using the same parameters and it stopped at 0.08.)
So it looks like mothur generated the initial .list, .rabund, and .sabund files but so far there is only data for the “unique” cutoff. Now it seems to be continually regenerating two files:
my.data.sorted.dist and my.data.sorted.dist.temp. The first file is 6.86GB and the second builds up to this size and then the process repeats without any addition to the list file. This has been running for ~12 hours.
Since the machine has 72GB of RAM i also tried just running cluster instead of hcluster. This process used about 40% of the memory and ran for 24 hours but still never got past the “unique” cutoff.
I have not encountered this issue using fn or nn so I wonder if it has something to do with an? Any suggestion?
While I am on the subject, hcluster has a “sorted” option. In what way is a matrix sorted?