So, Im having problems with dist.seqs and very large distance files.
Running dist.seqs(), then cluster()
mothur > dist.seqs(fasta=all.fasta, cutoff=0.25)
[snip]
Output File Names:
all.dist
It took 92621 to calculate the distances for 478317 sequences.
mothur > cluster(column=all.dist, count=all.count_table)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||[ERROR]: M02149_21_000000000-A5BRY_1_210M02149_21_000000000-A5BRY_1_1113_19447_16204 is not in your count table. Please correct.
There is no sequence named M02149_21_000000000-A5BRY_1_210M02149_21_000000000-A5BRY_1_1113_19447_16204 in all.fasta or all.count_table, so the problem isnt there.
Analyzing the problem
I believe the correct format of a column distance file is
sequence1 sequence2 distance
But when this bug occurs, in some cases we have more than three columns.
Total lines in file:
wc -l all.dist
814862058 all.dist
Lines with an incorrect number of fields:
awk 'NF != 3' < all.dist | wc -l
10
Since this is a fairly managable number, here they are:
M02149_21_000000000-A5BRY_1_1113_19447_16204 M02149_21_000000000-A5BRY_1_210M02149_21_000000000-A5BRY_1_1113_19447_16204 M02149_21_000000000-A5BRY_1_2102_7637_17019 0.2332
M02149_21_000000000-A5BRY_1_2108_10984_7014 M02149_21_000000000-A5BRY_1_1110_23173_26202 0.M02149_21_000000000-A5BRY_1_2108_10984_7014 M02149_21_000000000-A5BRY_1_1110_23173_26202 0.2183
M02149_21_000000000-A5BRY_1_1112_24440_11685 M02149M02149_21_000000000-A5BRY_1_1112_24440_11685 M02149_21_000000000-A5BRY_1_1105_22223_12654 0.1706
M02149_21_000000000-A5BRY_1_2109_25935_18589 M02149_21_000000000-A5BRY_1_2102_22911_21181 0.23M02149_21_000000000-A5BRY_1_2109_25935_18589 M02149_21_000000000-A5BRY_1_2102_22911_21181 0.2302
M02149_21_000000000-A5BRY_1_2112_18712_7015 M02149_21_000000000-A5BRY_1_1107_28790_17029 0.2M02149_21_000000000-A5BRY_1_2112_18712_7015 M02149_21_000000000-A5BRY_1_1107_28790_17029 0.254
M02149_21_000000000-A5BRY_1_2111_28332_18652 M02149_21_00000000M02149_21_000000000-A5BRY_1_2111_28332_18652 M02149_21_000000000-A5BRY_1_2101_9294_21789 0.253
M02149_21_000000000-A5BRY_1_1102_24261_18912 M02149_21_000000000-A5BRY_1_2107_9596_M02149_21_000000000-A5BRY_1_1102_24261_18912 M02149_21_000000000-A5BRY_1_2107_9596_20127 0.189
M02149_21_000000000-A5BRY_1_1104_10577_2914 M02149_21_000000000-A5BRY_1_1105_201M02149_21_000000000-A5BRY_1_1104_10577_2914 M02149_21_000000000-A5BRY_1_1105_20159_13418 0.25
M02149_21_000000000-A5BRY_1_1107_21390_3334 M02149_21_000000000-A5BRY_1_M02149_21_000000000-A5BRY_1_1107_21390_3334 M02149_21_000000000-A5BRY_1_1102_19081_26197 0.2421
M02149_21_000000000-A5BRY_1_1111_11191_21636 M02149_21_00M02149_21_000000000-A5BRY_1_1111_11191_21636 M02149_21_000000000-A5BRY_1_1113_18554_23430 0.2292
We can see the line causing our error message is there, along with some other mangled output. I couldnt say if there are other mangled lines with the correct number of fields, but it is possible.
I’ve been able to reproduce this bug on multiple systems, so it doesnt appear to be transient or hardware related. The same dataset works fine if I reduce complexity, with split.abund(), for example. Its possible outputting the distance file in phylip format will work around this issue, but it would be nice if this could be tracked down.