I’m trying to use mothur to generate a phylip-formatted distance matrix so I can use some of phylip’s programs to generate a netwick tree for the hypothesis testing approaches (like unifrac and parsimony). I was able to align my sequences using the “align.seqs” command (using the silva database for bacterial genes). I filtered my alignment using “filter.seqs” (using the vertical=T option). I generated a distance matrix using “dist.seqs” setting “phylip=T”. Unfortunately, when I try to use this distribution matrix in phylip (for the neighbor program), it errors out. The exact error is “end-of-line or end-of-file in the middle of species name for species 1”. I selected the lower-triangular data matrix format.
I was curious about the error, so I opened up my mothur distance matrix file and saw that my first entry had my sample name followed by a tab but no data. I manually added a random number as a test and reran the neighbor program. This time it worked without error. Am I using the commands in mothur wrong or is there an issue with how mothur is formatting the matrix? Has anyone else run into this problem and found a solution? Is there another program you can recommend to generate a netwick tree?
My system is a pc running windows xp (unfortunately I don’t have access to a linux system so I can’t use programs like ARB and Clearcut). I have 2 gb memory if that makes a difference. Any help would be appreciated. Thanks!
I think the answer is simple. It sounds like mothur is creating the lower triangular distance matrix as you asked for, except it does not provide the sequence vs. self distance. Example:
C 3 2
B 2 0
C 3 2 0
If this is not it, than you should make sure that your sequence names are 10 characters long. If they are more than 10, your screwed. If they are less then 10, make sure you fill the rest of the characters with spaces. Phylip is an old and picky program.
For your tree question, I thought Newick was a format and not a type? I’m not sure about this, but you should try the program Geneious, it has a ton of different tree options.
Hope this helps.
The phylip formatted matrix should be fine. You just have to tell phylip that the matrix is in the lower triangle format. The zeros on the diagonal will cause problems.
I did select the “lower triangle format” in neighbor. It still errored out with the same message. However, I did a little playing with it and I think I figured it out. I ran a small test alignment through mothur and generated a phylip formatted matrix. I ran the same alignment through phylips dnadist program and specified a lower triangle output. When I compared the two in a text editor, I noticed a subtle difference that I think is the root of the problem.
The distance matrix from dnadist has the sample name followed by two spaces. The matrix from mothur has the sample name followed by a tab. When modified the matrix from mothur and replaced the tab with two spaces, phylip recognized it without the error message. It’s interesting to note that phylip didn’t have problems with subsequent entries in the matrix (dnadist has the sample name followed by 3 spaces and then the matrix data wherease mothur retains the sample name-tab-data format).
I only tested this on a small subset, but I’ll post whether it works with a larger sample. Hopefully this fix will help anyone who runs into the same problem, but it’s also something to keep in mind if people use the phylip formated distance matrix with other programs.
Rewski52, thanks for your advice. I’ll take a look at Geneious in case phylip’s neighbor gives me problems. Yeah, newick is a format. That’s what I meant in my post.
PS: I agree with other posters. Pat and his team should be given props for generating and compiling a great suite of software. It’s a powerful set of program! Good work!
Thanks for the praise - be sure to tell your PIs so that when they review our proposals they can praise us too!
Also, thanks for the heads up on the bug report. I had forgotten about this “feature” in phylip. It actually limits one to using 10 character names. I suspect your sequence names were either 8 or 9 letters long if you only had one or two spaces. We’ll get this fixed…
Thanks for the information. I wanted to give an update. As written, the phylip function will only work with sample names that are 9 or 10 characters. My sample names were 8 characters and it gave the “end-of-line” error. However, if your sample names are longer than 10 characters, phylip will give an “invalid file type” error.
In the meantime, for anyone using mothur to generate phylip distance matrixes for use with phylip tree programs, just make sure your sample names or 9 or 10 characters in length and it should work fine. I had a large distance matrix (5.7 gb or so) that I couldn’t open in a text editor to make the changes to the first entry, so I’m rerunning my matrix with new sample names that are the correct size. I hope this works, but I don’t know if phylip’s neighbor program can handle a 5.7 gig file.
Anyway, hope this helps and hope the info allows you to change mothur’s format accordingly.
Thanks for the update. By the way, this bug only happens in the PHYLIP programs. I know that our matrices work with other programs that use phylip-formatted matrices (e.g. clearcut). We’re on it.