dist.seqs bug?

wmnwmn · July 20, 2010, 8:24pm

On a large alignment file generated by Mafft, I am seeing that dist.seqs returns a minimum distance 2.6%, corresponding to two differences on the sequences (which are all 75bp). However, there are definitely some that differ only by 1 difference:

Example:

test.seq =

BL021343
AAAACTTAAATGGGAATTGACAGCTTTAGAGATAGAGCTTTCTTCGGACAATTTTCAAGGTGCTGCATGGTTGTC
B016151
AAAACTTAAATGGGAATTGACAGCTTTAGAAATAGAGCTTTCTTCGGACAATTTTCAAGGTGCTGCATGGTTGTC

The Mafft alignment is pasted below, which has 1 difference when examined in an alignment viewer (JALView).

Mothur logfile (v.1.11.0) =
mothur > dist.seqs(fasta=test.align,cutoff=0.20)
0 0
1 0

Output File Name:
test.dist

It took 0 to calculate the distances for 2 sequences.

mothur > quit

test.dist=
BL021343 B016151 0.02632

------------------------------------------------------------------------
test.align= >B016151 -------------------------------a-----a-----a-------a-------- -----------------------c-----------t-----------t------------ ------a----------------------------------------------------- ------------------a----a---------------t-------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------g---------------------------------------g----------- ---------------------------------------------g-------------- ---------------------------------------------------------a-- -------------------------------------------------at-t------- g-------a-c--------------------------------a---------------- ------------------------------------------------------------ ------------g---------------------------------------c------- ----------------------------------------t------------------- --------t--------------------------------------------------- ------------------------t----------------------------------- ------------------------------------------------------------ -------------a-----------------------------g---------------- ----------------------------------------------a------------a ----------------------a------------------------------------- ---------------t-------------------------------------------- -----------------------------------------a------------------ -------------------g---------------------------------------- ------------------------------------------------------------ ----a------------------------------------------------------- ----------------------------------------------------g------- ------------------------------------------------------------ ---------------------------------------------c-------------- ------------------------------------------------------------ ------t----------------------------------------------------- ------------------------------------------------------------ -----------------------------t------------------------------ -----------------------------------------t--------------c--- ------------------------------------------------------------ ------------------t----------------------------------------- t----------------------------------------------------------- ------------------------------------c----------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------------------------------g--------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ -g---------------------------------------------------------- ------------------------------------------------------a----- ------------------------------------------------------------ ------------------------------------------------------------ ---------------------------------c-------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ -----------------------------------a------------------------ ----------------------a------------------------------t------ -----t----------------------------------------------------t- t------------------c---------------------------------------- -----a------------------------------------------------------ -----------------------------a----g----g---t-------g--c---t- -g-----------c------a--------------t-----g------------------ -----------------g---------------------------------t-------- ---------------------t----------------------------g--------- -----------------t----------------------------c------------- ----------------------------------------------------------- >BL021343 -------------------------------a-----a-----a-------a-------- -----------------------c-----------t-----------t------------ ------a----------------------------------------------------- ------------------a----a---------------t-------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------g---------------------------------------g----------- ---------------------------------------------g-------------- ---------------------------------------------------------a-- -------------------------------------------------at-t------- g-------a-c--------------------------------a---------------- ------------------------------------------------------------ ------------g---------------------------------------c------- ----------------------------------------t------------------- --------t--------------------------------------------------- ------------------------t----------------------------------- ------------------------------------------------------------ -------------a-----------------------------g---------------- ----------------------------------------------a------------g ----------------------a------------------------------------- ---------------t-------------------------------------------- -----------------------------------------a------------------ -------------------g---------------------------------------- ------------------------------------------------------------ ----a------------------------------------------------------- ----------------------------------------------------g------- ------------------------------------------------------------ ---------------------------------------------c-------------- ------------------------------------------------------------ ------t----------------------------------------------------- ------------------------------------------------------------ -----------------------------t------------------------------ -----------------------------------------t--------------c--- ------------------------------------------------------------ ------------------t----------------------------------------- t----------------------------------------------------------- ------------------------------------c----------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------------------------------g--------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ -g---------------------------------------------------------- ------------------------------------------------------a----- ------------------------------------------------------------ ------------------------------------------------------------ ---------------------------------c-------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ -----------------------------------a------------------------ ----------------------a------------------------------t------ -----t----------------------------------------------------t- t------------------c---------------------------------------- -----a------------------------------------------------------ -----------------------------a----g----g---t-------g--c---t- -g-----------c------a--------------t-----g------------------ -----------------g---------------------------------t-------- ---------------------t----------------------------g--------- -----------------t----------------------------c------------- -----------------------------------------------------------

pschloss · July 21, 2010, 4:41pm

Crap, you found a bad bug. This only matters if you have sequences that don’t overlap over the full length of the sequence. If you run filter.seqs(trump=.) this isn’t a problem. However, it is a problem, which will be fixed in the release that will be put out on Friday. Feel free to email us at mothur.bugs@gmail.com if you want the source file ahead of then. I really feel bad about these types of bugs and hope they don’t affect anyone. Again, if you use filter.seqs(trump=.) like we suggest, this isn’t a problem - alternatively, unless you are still using GS20 data it will probably inflate distances by 0.005 to 0.002.

wmnwmn · July 22, 2010, 4:25pm

I think, in this case, that filter.seqs wouldn’t have helped, because the alignment file came from Mafft, with no reference. I would hope that Mafft isn’t
putting in gratuitous columns that are just all gaps!

What must have happened is that one of the 42k (unique) sequences had some junk at the end, giving rise to terminal
gaps in every alignment pair. That’s why I saw no alignment pairs with less than 2 differences, which is what made
me take notice.

pschloss · July 22, 2010, 7:44pm

Thanks again for catching this - it’s been in there from the beginning. If this is the output from mafft, then I would be worried about the quality of the alignment. Sequences with 75 bases shouldn’t have such long alignments. Assuming these are 16S, I suspect that if you aligned them with our aligner to the reference and then did filter(trump=T, vertical=.) the alignment wouldn’t be longer than 200 bp. There must be a few really bad sequences in there - sometimes we find sequences don’t actually align to the region that was sequenced.

umberar7 · August 6, 2010, 8:41pm

hi,
I think there may be another bug in dist.seqs or read.dist…

mothur > read.dist(column=mothur4.final.dist, name=mothur4.final.names)

********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||AAError: Sequence ‘FecalT1-3_F1WMYZV04IO9IrumenT1-5_F1WMYZV03F2PX8’ was not found in the names file, please correct

It seems as it's not reading the names file properly.

Any thought?
Thanks

pschloss · August 9, 2010, 7:43pm

Can you send your names file to us at mothur@bugs.gmail.com?

Also, double check that you have a sequence named “FecalT1-3_F1WMYZV04IO9IrumenT1-5_F1WMYZV03F2PX8” in the first column of your names file.

amcomeau · September 30, 2010, 7:02pm

Pat,
I also just came across the same bug - I ran a large column distance matrix (135,961 seqs = 15 Gb) through and got:

mothur > hcluster(column=AMCE_E.alignT.dist, name=AMCE_E.alignT.names, cutoff=0.03)
It took 5423 seconds to sort.
AAError: Sequence ‘GNV8VFP05F00AD’ was not found in the names file, please correct

Although the names file has all 135,961 seqs that are in the fasta/dist file/groups file, etc…everything matches.

However, why do you say “double check that you have a sequence named “X” in the first column of your names file”? My names file is not 135,961 seqs long in the first column, but only 120,371 - but I thought that was normal since the names file was not supposed to be redundant - i.e. when there were identical seqs (found using unique.seqs to generate the names file in the first place) that the names file would group them together in the second column, thereby reducing the number of total lines and therefore you would have some seqs that would not appear in the first column…am I missing something?

Thanks!

pschloss · September 30, 2010, 8:56pm

Any of the sequence names in the distance matrix should be found in the first column of the names file since those are the unique sequence names. I don’t recall ever hearing back from umberar7, so I’m not sure what the bug was. If you can double check that the problem sequence is in the first column of the names file that’d be great. Often times people forget to update the names file when running unique.seqs twice, when using screen.seqs, precluster.seqs or remove.seqs.

amcomeau · September 30, 2010, 9:06pm

Hi again Pat,
The problem sequence (maybe just the first, since dist.seqs must stop at the first error - no?) is here in the names files:

Column 1 Column 2
…
GNV8VFP05F9HHU GNV8VFP05F9HHU,GNV8VFP05GC5YE,GNV8VFP05F00AD
…

So, it’s in the second column and doesn’t show up in the first column. As I mentioned above, there are plenty of seqs missing from the first column (only 120,371 lines for 135,961 seqs), but when I import the second column into Excel and parse it using the comma separator to see how many seqs are listed/represented, it’s all 135,961 of them…is this not how the format is supposed to work?

FYI: I simply ran my aligned file through unique.seqs in order to get the names file to use column-format in the downstream analyses and went directly into dist.seqs (ie: I did not run any other commands in between).

umberar7 · September 30, 2010, 9:25pm

Hi Pat,
we did exchange quite a few emails at that time and it was determined that there was a problem with the MPI version when running dist.seqs. For some reason, it concatenated the sequence names hence why the sequence names weren’t found. I haven’t used dist.seqs with the newest version of mothur so I’m not sure if this problem was fixed in the latest release.

Thanks

amcomeau · October 1, 2010, 5:19pm

Pat and umberar7,
I currently have version 1.10.2 compiled on our server - has this dist.seqs bug been fixed since then in the newer version (give me yet anothrer reason to recompile)?

Otherwise I guess my only option is to redo the matrix in the larger phylip format in order to continue…

Thanks.

pschloss · October 4, 2010, 12:10pm

Sorry for the oversight, umberar7.

pschloss · October 4, 2010, 1:19pm

janau18 - Are you certain that you get a distance of 0 with v1.13? You should be getting a distance of 1. If you are using the most recent version, please email us (mothur.bugs@gmail.com) two sequences where you’re having this weird behavior.

Thanks,
Pat

pschloss · October 4, 2010, 1:26pm

amcomeau…

I currently have version 1.10.2 compiled on our server - has this dist.seqs bug been fixed since then in the newer version (give me yet anothrer reason to recompile)?

Otherwise I guess my only option is to redo the matrix in the larger phylip format in order to continue…

Yes, you should get the latest version of mothur; however, it probably won’t do anything for read.dist issue. I’m still concerned that you are perhaps putting different sequences into MAFFT than you should be. Is the file input to MAFFT something ending in *unique.fasta? As I understand it you are doing…

align.seqs on *.fasta → *.align
dist.seqs on *mafft.align → *.dist
unique.seqs on *.align → *.unique.fasta & *.names
read.dist on *dist & *names → error

what you should be doing is something like either…

align.seqs on *.fasta → *.align
unique.seqs on *.align → *.unique.fasta & *.names
dist.seqs on *.unique.fasta → *.unique.dist
read.dist on *.unique.dist & *.names

or…

unique.seqs on *fasta → *unique.fasta & *names
align.seqs on *.unique.fasta → *unique.align
dist.seqs on *unique.mafft.fasta → *unique.dist
read.dist on *unique.dist & *.names

If I’m misinterpreting what you’re doing, please give us an explicit blow by blow of what you’re doing…
Pat

amcomeau · October 4, 2010, 4:59pm

Pat,
What I did was the following (not sure why you’re talking about MAFFT - is that what you’re calling the align.seqs command routine?) - I started with the unaligned FASTA of all my tags:

align.seqs(candidate=X.fas, template=silva.eukarya.fasta, ksize=9, processors=4)
filter.seqs(fasta=X.align.fas)

I then manually examined the alignment and trimmed it (removing some seqs, finishing with 135,961), generating X.alignT.fas. I then realised that I needed a names file if I wanted to run hcluster in column format:

unique.seqs(fasta=X.alignT.fas)

I did not then “re-trim” the “reduced” alignment since identical seqs are removed and should not, by definition, change the alignment (not introduce nor remove any gaps). I took the original FASTA file (X.alignT.fas not X.alignT.unique.fas) and the generated names file and ran:

dist.seqs(fasta=X.alignT.fas, cutoff=0.03, processors=4)
read.dist(column=X.alignT.dist)

Which worked fine…then the error-generating command:

hcluster(column=X.alignT.dist, name=X.alignT.names, cutoff=0.03)
It took 5423 seconds to sort.
AAError: Sequence ‘GNV8VFP05F00AD’ was not found in the names file, please correct

I assume the problem was not using the unique.fas file instead? hcluster must find lines not referred to in the names file because of this?

I’m not sure why I didn’t use the unique.fas file instead at that point - perhaps I’m not sure these data will be accounted for later on in all the downstream analyses (you can maybe reassure me here)…I want those identical seqs counted in all the commands afterwards for richness and evenness calculation (summary.single), for the seq labels in the FASTA file I get from a get.oturep(sorted=size) command, for example (in order to identify the dominant OTUs representing >1% of all the seqs)…

pschloss · October 4, 2010, 5:22pm

amcomeau,

Sorry, there’s about 4 or 5 threads here and I confused myself… Here’s what you should be doing…

align.seqs(candidate=X.fas, template=silva.eukarya.fasta, ksize=9, processors=4)
filter.seqs(fasta=X.align)
unique.seqs(fasta=X.alignT.filter.fasta)   #note the filename change
dist.seqs(fasta=X.alignT.filter.unique.fasta, cutoff=0.03, processors=4)  #no the alignment hasn't changed, but the number of sequences has
read.dist(column=X.alignT.filter.unique.dist, name=X.alignT.filter.names)

A couple of comments…

I did not then “re-trim” the “reduced” alignment since identical seqs are removed and should not, by definition, change the alignment (not introduce nor remove any gaps). I took the original FASTA file (X.alignT.fas not X.alignT.unique.fas) and the generated names file and ran:

You run unique.seqs to “remove” the redundant sequences and so if you only calculate distances using these unique sequences, you can only read those in with the help of the name file. The way you’ve run it dist.seqs, you haven’t used the filtered nor the uniq’d sequences. You must provide a name file if you use column in read.dist. The way you’ve run read.dist, your name file would be two identical columns because you haven’t reduced the complexity of the dataset. Then when you run hcluster, you are reading in the same matrix again, but you’re giving it a spurious name file that has fewer sequences in it than in the X.align file. This is the cause of your error. I would strongly encourage you to follow the Costello example analysis - things like end trimming, chimera checking, etc, make the analysis more robust and so that you don’t need to use hcluster.

Hope this makes sense…
Pat

amcomeau · October 4, 2010, 6:26pm

OK Pat,
A couple of things to follow up on:

I do all the appropriate trimming/chimera removal (that I want to do) after the alignment is generated and I have trimmed out all the gap columns - I wasn’t being exactly literal with every step/name-change I apply to the files, but suffice it to say I have an alignment in good condition, containing all my sequences with useless stuff/columns filtered out, called X.alignT.fas.
Even by removing redundant seqs, I still have 120k and that is why I bothered getting into this column-format matrix mess (which necessitated the names file, hence a run of unique.seqs) to begin with - I thought the column matrix would cut down alot on the phylip matrices that I’d been using before with no probs (I wouldn’t have switched otherwsie)…I also went to hcluster instead of standard cluster for this reason and because it was eating the server’s RAM (a 12k seqs phylip matrix is 600 Mb and runs for 4 hrs on cluster, so the 6 Gb matrix was going to be impossible in the RAM/swap).
So, end of the story appears to be that you cannot create a column-formatted distance matrix without running a unique.seqs command (ony way to generate names file?) - ie: can’t do it with all the data even if you wanted to? My problem is that those identical seqs are not “redundant” to me - I want them (and they must be) in the calculations of evenness and they must be taken into account when the OTUs are associated to their numbers of sequences they represent (as mentioned above in the get.oturep command - ie: the last number [388 seqs] in the name “SequenceX|446|388” of OTU#446). I’m not sure that if you “de-replicate” the data, your diversity measures are then being correctly calculated on the original (pre unique.seqs) numbers (original community evennes/distribution), nor will the above get.oturep command bring back the right numbers of representative seqs in each OTU.

Do you see what I’m getting at - this is important evenness data that can’t be thrown out of a 16S analysis that plans to speak of community structure. An example would be a sample with 1000 different OTUs in it, with every OTU being present once except for one OTU who was blooming 1000-fold greater than the rest at the time of sampling. If you sequence 2000 seqs from this community, you’ll get 999 sequences representing each an OTU and 1001 seqs representing the one dominant OTU. This community has richness=1000 with a very skewed evenness profile. If you run unique.seqs, you end up with 1000 seqs (so you’ve dereplicated by 2-fold) - the richness remains 1000, but you now have 1 seq/OTU (=completely even). If you run summary.single on this reduced dataset, do the diversity calculation take this into account? Will the rarefaction/collector curves remember the 1000 other sequences (if not, the curve will be a 1:1 line)? Will a run of get.outrep say that OTU#1000 represents 1001 seqs or just 1? This community is very different from one where the exact same OTUs are there, but with all represented the same number of times (ie: 2 of each in 2000 seqs)…or with OTU#547 represented 1000-fold more instead of OTU#1000 - 2000 seqs of each of these communities will dereplicate into 1000 seqs…will they run out exactly the same way in the mothur commands?

Hopefully I am simply wrong about the program side of things and that mothur keeps track of the original numbers all the way through the analysis commands, but I get the feeling that it is not so (and it seems to me I tested once specifically with the get.oturep command which is the most important since it allows you to identify the taxonomy of the rank-abundance data)…please enlighten me.

pschloss · October 4, 2010, 6:58pm

Hopefully I am simply wrong about the program side of things and that mothur keeps track of the original numbers all the way through the analysis commands, but I get the feeling that it is not so (and it seems to me I tested once specifically with the get.oturep command which is the most important since it allows you to identify the taxonomy of the rank-abundance data)…please enlighten me.

Right - you are wrong . The reason for the name file is to keep track of the original frequencies. That’s why you have to read it back in with the dist.seqs/hcluster command.

westcott · October 6, 2010, 7:07pm

Hi Janau18,

It seems we have a bug in mothur’s distance calculation for countends=F. Non-overlapping sequences should have a distance of 1, regardless of the countends parameter. What mothur was doing when countends=F was looking for the starting and ending position where the sequences overlapped and then counting the differences in that window. The problem was we initialized the start and end value to the beginning of the sequence, so if the sequences did not overlap we still looked for a mismatch at position 1, didn’t find a mismatch so the distance calculated to be 0.0. Thank you for bringing this bug to our attention, and for taking the time to give files that clearly illustrated the problem. The fix will be part of 1.14.0 releasing later this month.

-Sarah

Topic		Replies	Views
Large dist.seqs producing corrupt files? mothur bugs	11	10593	November 1, 2016
Average Clustering of ~10k unique V6 sequences Commands in mothur	10	8679	May 27, 2011
dist.seqs Commands in mothur	1	14370	February 5, 2010
Problems handling a >50 Gb distance matrix (cluster command) mothur bugs	12	14746	October 18, 2013
Phylip vs. Column-based format changing downstream results Commands in mothur	2	5275	June 17, 2010

dist.seqs bug?

The Mafft alignment is pasted below, which has 1 difference when examined in an alignment viewer (JALView).

mothur > quit

Related topics