dist.seqs evolution model


I have searched extensively through the wiki, the original mothur pub in AEM, and Pat’s latest PLoS CB manuscript and I can’t seem to determine which model of nucleotide substitution is used in the dist.seqs command. All of the literature points out that the algorithm is identical to DNADIST from the PHYLIP package (with added flexibility in gap penalties) but DNADIST has variable options, one of which is the distance calculation model. The default model in PHYLIP is F84 (which the documentation for DNADIST notes incorporates some degree of maximum likelihood?) but my gut tells me that dist.seqs uses Jukes & Cantor nt substitution model for simplicity…some clarification would be much appreciated. Also, if there are other aspects of DNADIST which match or differ from the default implementation in PHYLIP when implemented in mothur it would be useful to note these as well in the documentation.

One final note: the wiki points the reader to the methods employed by Sogin et al 1995, but I could not find any papers with Sogin M as first author in 1995 via Google Scholar…perhaps this is a book chapter? Is there a list of references associated with the Wiki where I might find this reference?

Thanks as always

In the PLoS CB paper I point out that most people use a model out of phylogenetic guilt. dist.seqs does not use a correction for multiple substitutions since none of them allow for gaps. phylip treats this as missing data, not an insertion/deletion. the sogin paper should be his first one using pyrosequencing and was published in PNAS. perhaps it’s 2006?


I have a query about dist.seqs and amino acid sequences. Operating in ignorance, I got a distance file by processing an amino acid alignment in fasta format using the dist.seqs command. I then read posts in the forum e.g. “Using Mothur for protein sequences?!” and the wiki which implied protein distance matrices should be calculated outside mothur. Is this a real distance matrix , and if so, is this done using PROTDIST ?


Yeah, dist.seqs should not be used for amino acid sequences. It will treat anything other than an A, T, G, C, or U as an N

I have to agree with the last post that it is not obvious how distance is calculated. Maybe I overlooked it, but I looked around for a formula defining distance and couldn’t find it. Perhaps it would be feasible to include a link in the first sentence of the Dist.seqs page pointing to a definition.