I ran hcluster on our large distance matrix (85699 seconds!), but see interesting cutoff values when things start to get far apart. We’re not sure if these are meaningful for our analysis yet, but wanted to bring it to your attention. When I read the list file created by the cluster (with associated groups file) using
I get the following output:
Is this an error in the code, or should I expect multiple zeros like this? It continues in this way until all distances are found. Also, what about the duplicates at 2.56? Is this because I’m using reads that are unique across all samples?
Acutally, my bigger concern is how you got distances >0.30… I suspect you have a number of sequences that don’t overlap and give large distances. To check this you should run summary.seqs on the alignment file that you run dist.seqs on. You’ll probably find a number of sequences that end before most sequences start and vice versa.
Regardless, the bug is weird. Is there any way that you could either post online or email the distance matrix to us? Alternatively, you could email us (firstname.lastname@example.org) the sequences and the list of commands you run to get to the hcluster step. This is a new command so its entirely possible that the kinks aren’t all ironed out yet.
Distances larger than 0.3 are a result of using my own matrix of paralinear distances (I posted something similar to the commands board) instead of a %-based distance metric like those from dist.seqs. I’ll check about the matrix, it’s pretty big (>5G).