Large dist.seqs producing corrupt files?

So, Im having problems with dist.seqs and very large distance files.

Running dist.seqs(), then cluster()

mothur > dist.seqs(fasta=all.fasta, cutoff=0.25)

[snip]

Output File Names: 
all.dist

It took 92621 to calculate the distances for 478317 sequences.

mothur > cluster(column=all.dist, count=all.count_table)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||[ERROR]: M02149_21_000000000-A5BRY_1_210M02149_21_000000000-A5BRY_1_1113_19447_16204 is not in your count table. Please correct.

There is no sequence named M02149_21_000000000-A5BRY_1_210M02149_21_000000000-A5BRY_1_1113_19447_16204 in all.fasta or all.count_table, so the problem isnt there.

Analyzing the problem
I believe the correct format of a column distance file is

sequence1 sequence2 distance

But when this bug occurs, in some cases we have more than three columns.

Total lines in file:

wc -l all.dist
814862058 all.dist

Lines with an incorrect number of fields:

awk 'NF != 3' < all.dist | wc -l
10

Since this is a fairly managable number, here they are:

M02149_21_000000000-A5BRY_1_1113_19447_16204 M02149_21_000000000-A5BRY_1_210M02149_21_000000000-A5BRY_1_1113_19447_16204 M02149_21_000000000-A5BRY_1_2102_7637_17019 0.2332
M02149_21_000000000-A5BRY_1_2108_10984_7014 M02149_21_000000000-A5BRY_1_1110_23173_26202 0.M02149_21_000000000-A5BRY_1_2108_10984_7014 M02149_21_000000000-A5BRY_1_1110_23173_26202 0.2183
M02149_21_000000000-A5BRY_1_1112_24440_11685 M02149M02149_21_000000000-A5BRY_1_1112_24440_11685 M02149_21_000000000-A5BRY_1_1105_22223_12654 0.1706
M02149_21_000000000-A5BRY_1_2109_25935_18589 M02149_21_000000000-A5BRY_1_2102_22911_21181 0.23M02149_21_000000000-A5BRY_1_2109_25935_18589 M02149_21_000000000-A5BRY_1_2102_22911_21181 0.2302
M02149_21_000000000-A5BRY_1_2112_18712_7015 M02149_21_000000000-A5BRY_1_1107_28790_17029 0.2M02149_21_000000000-A5BRY_1_2112_18712_7015 M02149_21_000000000-A5BRY_1_1107_28790_17029 0.254
M02149_21_000000000-A5BRY_1_2111_28332_18652 M02149_21_00000000M02149_21_000000000-A5BRY_1_2111_28332_18652 M02149_21_000000000-A5BRY_1_2101_9294_21789 0.253
M02149_21_000000000-A5BRY_1_1102_24261_18912 M02149_21_000000000-A5BRY_1_2107_9596_M02149_21_000000000-A5BRY_1_1102_24261_18912 M02149_21_000000000-A5BRY_1_2107_9596_20127 0.189
M02149_21_000000000-A5BRY_1_1104_10577_2914 M02149_21_000000000-A5BRY_1_1105_201M02149_21_000000000-A5BRY_1_1104_10577_2914 M02149_21_000000000-A5BRY_1_1105_20159_13418 0.25
M02149_21_000000000-A5BRY_1_1107_21390_3334 M02149_21_000000000-A5BRY_1_M02149_21_000000000-A5BRY_1_1107_21390_3334 M02149_21_000000000-A5BRY_1_1102_19081_26197 0.2421
M02149_21_000000000-A5BRY_1_1111_11191_21636 M02149_21_00M02149_21_000000000-A5BRY_1_1111_11191_21636 M02149_21_000000000-A5BRY_1_1113_18554_23430 0.2292

We can see the line causing our error message is there, along with some other mangled output. I couldnt say if there are other mangled lines with the correct number of fields, but it is possible.

I’ve been able to reproduce this bug on multiple systems, so it doesnt appear to be transient or hardware related. The same dataset works fine if I reduce complexity, with split.abund(), for example. Its possible outputting the distance file in phylip format will work around this issue, but it would be nice if this could be tracked down.

I would like to add a “me too” to this bug report. I get a handful of corrupted lines in large column-formatted .dist files. As reported above, sometimes the sequence name is corrupted, and sometimes there are incorrect numbers of columns. It is only noticeable in very large files (~15 corrupted lines in 126 GB .dist file). An example of a corrupted sequence name is below. Let me know if I can provide any other information.

mothur > cluster(column=all.hydrog.filter.phylipok.trimmed.fullnames.unique.pick.dist, count=all.hydrog.filter.phylipok.trimmed.fullnames.pick.count_table)

********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||||[ERROR]: 201713246|DCO_BRZ_Bv4v5|Serp_LIG_LER20_11_117_Oct11_A2_Bv4v201713246|DCO_BRZ_Bv4v5|Serp_LIG_LER20_11_117_Oct11_A2_Bv4v5|0.5% is not in your count table. Please correct.

For what its worth, I eventually worked around this problem by renaming all my sequences to have shorter names with a small script – names like ABC33_11022 instead of M02149_21_000000000-A5BRY_1_2112_18712_7015. Since the distance file contains the full names of sequences, long names can be the bulk of the size of the file.

I never had time to fully chase down the underlying issue, but the above workaround has prevented it from recurring and greatly cut down the size of my .dist files.

Thanks, that makes a lot of sense.

Hey Adam (or anyone),

Could you help me with the rename.seqs command? This is the command that I ran–> rename.seqs(fasta=Cave_Bacteria.trim.contigs.good.unique.fasta, group=Cave_Bacteria.contigs.good.groups, name=Cave_Bacteria.trim.contigs.good.names) But i’m getting blank files. What am I missing?

rename.seqs will make your sequence names longer by appending the group names to the sequence names.

Sorry, I’ve been out of the field for a while so I cant find that exact script. I had at the time my fastq sequences demultiplexed by sample, so I probably just iterated through the sequences in their respective files using the fastq sample name as a base then rewriting the sequence name with sample_name+running_count (in hex, since it uses less ASCII chars).

Apologies I couldnt be of much more help. Biopython is your friend.

We are planning on adding this feature to mothur for version 1.37.0.

Dear Pat,

I’m using Mothur V. 1.37.3 and still encounter the same problem, with every large file - it works with a set of 50,000 sequences.
Here are the details of one of the input files, the commands and the outputs. System info: Ubuntu 14.04, 64 bits, 12-processors, 31.3 GB RAM, processor IntelXeon 3.5 GHz.

Input alignment: Snow.unique.pick.good.pick.align. It was aligned with a template and is very compact. The sequences have been renamed with short names.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 187 127 0 3 1
2.5%-tile: 1 349 313 0 5 5644
25%-tile: 1 349 322 0 6 56431
Median: 1 349 322 0 6 112861
75%-tile: 1 349 323 0 6 169291
97.5%-tile: 1 349 328 0 6 220078
Maximum: 222 349 338 0 9 225721
Mean: 1.75505 348.823 321.416 0 5.93658

of Seqs: 225,721

calculating the distance table

dist.seqs(fasta=Snow.unique.pick.good.pick.align, cutoff=0.20, processors=12)
Took 4-5 hours.
Snow.unique.pick.good.pick.dist 120,303,341 ko

making a count_table for the clustering

Snow.pick.pick.pick.names 225,721 seqs
Snow.pick.pick.pick.groups 439,402 seqs
count.seqs(name=Snow.pick.pick.pick.names, group=Snow.pick.pick.pick.groups)
Snow.pick.pick.pick.count_table 439,402 seqs

clustering

cluster(column=Snow.unique.pick.good.pick.dist, count=Snow.pick.pick.pick.count_table)
[ERROR]: 14235_PAR2_NG14235_PAR2_NGP is not in your count table. Please correct.

-> The sequence name is: 14235_PAR2_NGP - the dist table “stutters”.

Thanking you in advance,

AM Fiore-Donno

How big is the distance file? What file format file format is your disk? Have you tried it with processors=1?

Hi Sarah and Pat,

I have got a similar problem in cluster.split, in the first run, it give errors as above. I realize may be due to low memory based on your comments.

It is a large data set, total sequence number is about 38million and unique sequence is 1.4 million. Then I have changed to work with 64 cores and 500gb memory system with latest mothur version using 30 processors, and the run has last for more than 3 days at step cluster.split, and still not finish yet. Do you think this is normal or something may get wrong (there were no errors in previous steps)?

Any suggestions if we plan to run more samples together? I think only run all samples together through cluster step allows for comparison among samples at OTU levels, is that right?

Thanks,
Junnie

I suspect you have a data quality problem:

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/

You might try diff=3 in pre.cluster, but you may be limited to using a phylotype-based approach.

Pat