pairwise.seqs

Hi all. The pairwise.seqs command appears to estimating distances incorrectly. I noticed this while calculating pairwise distances for some fungal data and replicated the result with a subset of the esophagus dataset. The distances are, on average, 23 % larger than those calculated after aligning against core_set_aligned.imputed.fasta and 29 % larger than using ESPRIT (using the same gapopen [10] and gapextend [-0.5] parameters as the ESPRIT default), with a large range (0 - 73 % larger).

Here’s the mothur screen output. I was running the 32-bit version on my MacBook Pro. I can supply additional files if they are needed.

Best regards,
Jeff


mothur v.1.17.0 Last updated: 2/25/2011

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type ‘help()’ for information on the commands that are available

Type ‘quit()’ to exit program



mothur > sub.sample(fasta=esophagus.fasta)

Sampling 71 from 710.

Output File Names:
esophagus.subsample.fasta


mothur > unique.seqs(fasta=esophagus.subsample.fasta)
Output File Names: esophagus.subsample.unique.fasta esophagus.subsample.names
mothur > align.seqs(candidate=esophagus.subsample.unique.fasta, template=core_set_aligned.imputed.fasta)
Reading in the core_set_aligned.imputed.fasta template sequences... DONE. Aligning sequences from esophagus.subsample.unique.fasta ... 70 It took 3 secs to align 70 sequences.
Output File Names: esophagus.subsample.unique.align esophagus.subsample.unique.align.report
mothur > dist.seqs(fasta=esophagus.subsample.unique.align)

0 0
69 0

Output File Name:
esophagus.subsample.unique.dist

It took 0 to calculate the distances for 70 sequences.

mothur > system(cp esophagus.subsample.unique.dist esophagus.align.dist)


mothur > pairwise.seqs(fasta=esophagus.subsample.unique.fasta)

Processing sequences from esophagus.subsample.unique.fasta …
0 0
69 39
It took 39 to calculate the distances for 70 sequences.

Output File Name:
esophagus.subsample.unique.dist


mothur > system(cp esophagus.subsample.unique.dist esophagus.pairwise.dist)
mothur > read.dist(column=esophagus.align.dist, name=esophagus.subsample.names)

********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||


It took 0 secs to read

mothur > cluster()

unique 2 69 1
0.00 9 50 3 2 0 0 0 0 0 1
0.01 16 23 7 1 1 1 1 0 0 0 0 0 0 0 0 0 1
0.02 17 15 9 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1
0.03 19 14 6 2 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1
0.04 20 11 4 4 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.06 20 11 3 3 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.07 21 10 3 3 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.08 24 10 3 2 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.09 24 10 2 2 1 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.12 33 10 2 2 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.13 33 8 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.18 33 6 2 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.21 33 5 2 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.24 35 4 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.25 35 1 2 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.27 35 1 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
0.29 54 1 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.30 57 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.32 59 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.33 62 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.37 71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Output File Names:
esophagus.align.fn.sabund
esophagus.align.fn.rabund
esophagus.align.fn.list

It took 0 seconds to cluster

mothur > read.dist(column=esophagus.pairwise.dist, name=esophagus.subsample.names)

********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||


It took 0 secs to read

mothur > cluster()

unique 2 67 2
0.00 10 46 4 1 1 0 0 0 0 0 1
0.01 17 29 6 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.02 17 23 9 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.03 17 20 9 2 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.04 18 19 8 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
0.05 18 18 7 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
0.07 18 18 7 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1
0.10 18 18 6 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.11 18 16 4 3 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.13 18 14 5 3 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.15 18 13 5 3 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
0.18 18 11 6 3 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
0.25 18 7 8 3 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
0.30 18 4 8 4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
0.35 20 4 7 4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.41 20 3 6 5 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.43 20 3 4 5 2 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.44 20 2 3 6 2 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.47 20 2 2 6 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.48 20 1 2 3 2 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.72 20 0 0 3 3 0 3 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.73 24 0 0 2 2 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
0.74 24 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
0.75 24 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1
0.76 31 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
0.77 31 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
0.78 71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Output File Names:
esophagus.pairwise.fn.sabund
esophagus.pairwise.fn.rabund
esophagus.pairwise.fn.list

It took 0 seconds to cluster

mothur > pairwise.seqs(fasta=esophagus.subsample.unique.fasta, align=needleman, gapopen=-10, gapextend=-0.5)

Processing sequences from esophagus.subsample.unique.fasta …
0 0
69 36
It took 36 to calculate the distances for 70 sequences.

Output File Name:
esophagus.subsample.unique.dist


mothur > system(cp esophagus.subsample.unique.dist esophagus.pwGapParam.dist)
mothur > read.dist(column=esophagus.pwGapParam.dist, name=esophagus.subsample.names)

********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||


It took 0 secs to read

mothur > cluster()

unique 2 67 2
0.00 10 46 4 1 1 0 0 0 0 0 1
0.01 17 29 6 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.02 17 21 10 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.03 17 18 10 2 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.04 18 17 9 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
0.05 18 16 8 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
0.07 18 16 7 2 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1
0.10 18 16 7 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.12 18 15 6 2 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.13 18 13 4 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.18 18 11 5 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.23 18 9 6 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.26 18 7 7 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.27 18 5 8 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.28 18 5 6 4 2 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.32 20 5 5 4 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.35 20 4 4 5 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.38 20 2 5 5 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.39 20 2 3 5 3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.41 20 1 3 5 2 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.46 23 1 3 4 2 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.51 23 0 1 4 2 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.61 23 0 1 4 1 2 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.68 23 0 1 3 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.70 23 0 1 2 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.71 23 0 0 2 0 1 0 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1
0.72 23 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1
0.75 34 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
0.76 53 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.77 58 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.78 71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Output File Names:
esophagus.pwGapParam.fn.sabund
esophagus.pwGapParam.fn.rabund
esophagus.pwGapParam.fn.list

It took 0 seconds to cluster

mothur > quit()

Thanks, Jeff. Could you send us a couple of the sequences that are causing this issue (mothur.bugs - gmail)?

Hi Jeff,

So I think there’s a bit of apples to oranges going on here. First off, by default pairwise.seqs runs the needleman algorithm, which doesn’t use the gap extension value and by default counts differences at the ends between bases and gaps. I’m pretty sure that ESPRIT uses gotoh, which does use gap extension values (also available within mothur) and ignores gaps at the ends (also available within mothur). Here’s some things I looked at to address your query with the data you sent me…


mothur "#align.seqs(candidate=esophagus.subsample.unique.fasta, template=core_set_aligned.imputed.fasta);dist.seqs(fasta=esophagus.subsample.unique.align)" mv esophagus.subsample.unique.dist gg.dist

mothur “#pairwise.seqs(fasta=esophagus.subsample.unique.fasta)”
mv esophagus.subsample.unique.dist pw.default.dist

The R2 between pw.default.dist and gg.dist was: 0.9909458
The slope between pw.default.dist(y) and gg.dist(x) was: 0.8821


mothur "#pairwise.seqs(fasta=esophagus.subsample.unique.fasta, gapopen=-10)" mv esophagus.subsample.unique.dist pw.need_10.dist

The R2 between pw.need_10.dist and gg.dist was: 0.926943
The slope between pw.need_10.dist(y) and gg.dist(x) was: 1.057


mothur "#pairwise.seqs(fasta=esophagus.subsample.unique.fasta, align=gotoh)" mv esophagus.subsample.unique.dist pw.gotoh.dist

The R2 between pw.gotoh.dist and gg.dist was: 0.9905227
The slope between pw.gotoh.dist(y) and gg.dist(x) was: 0.9149



mothur "#pairwise.seqs(fasta=esophagus.subsample.unique.fasta, align=gotoh, gapopen=-10, gapextend=-0.5)" mv esophagus.subsample.unique.dist pw.gotoh_10_05.dist

The R2 between pw.gotoh_10_05.dist and gg.dist was: 0.9891668
The slope between pw.gotoh_10_05.dist(y) and gg.dist(x) was: 0.977


The R2 between pw.gotoh_10_05.dist and pw.default.dist was: 0.9970326 The slope between pw.gotoh_10_05.dist(y) and pw.default.dist(x) was: 1.107

The R2 between pw.gotoh_10_05.dist and pw.need_10.dist was: 0.9466522
The slope between pw.gotoh_10_05.dist(y) and pw.need_10.dist(x) was: 0.9015


I *think* these last two comparisons describe what you are seeing. Incidentally, I'd add that the gap/open penalties of 10/0.5 seem to do a poorer job of replicating greengenes than the default values of 2.0/1.0 as judged from R2 values. Also, this does confirm what I wrote in an earlier PLoS Comp Bio paper that pairwise distances predict smaller distances than those derived from alignments to greengenes or the better alignment - SILVA.

Let me know if you still think there are problems or have further questions…
Pat

Hi Pat. Thanks for looking into this. I’m not sure what I’m doing wrong but I’m not able to replicate your results. Just looking at the first comparison I ran the same code on the file I sent earlier and got something vastly different from you.

/Applications/mothur “#align.seqs(candidate=esophagus.subsample.unique.fasta, template=core_set_aligned.imputed.fasta);dist.seqs(fasta=esophagus.subsample.unique.align)”
mv esophagus.subsample.unique.dist gg.dist

/Applications/mothur “#pairwise.seqs(fasta=esophagus.subsample.unique.fasta)”
mv esophagus.subsample.unique.dist pw.default.dist


This results in an R2 of 0.32 and slope (pw.default by gg) of 1.33 (shown below). This was using v.1.17.0 on my mac and I got the exact same result using v.1.15.0 on a PC. Any ideas?

From R:

gg<-read.table(‘gg.dist’,stringsAsFactors=F)
pw<-read.table(‘pw.default.dist’,stringsAsFactors=F)

str(gg)
‘data.frame’: 2415 obs. of 3 variables:
$ V1: chr “9_3_14” “65_7_25” “65_7_25” “65_9_4” …
$ V2: chr “9_3_22” “9_3_22” “9_3_14” “9_3_22” …
$ V3: num 0.0981 0.1015 0.0439 0.1004 0.0427 …
str(pw)
‘data.frame’: 2415 obs. of 3 variables:
$ V1: chr “9_3_14” “65_7_25” “65_7_25” “65_9_4” …
$ V2: chr “9_3_22” “9_3_22” “9_3_14” “9_3_22” …
$ V3: num 0.0518 0.0593 0.0404 0.0581 0.0392 …

summary(lm(pw[,3]~gg[,3]))

Call:
lm(formula = pw[, 3] ~ gg[, 3])

Residuals:
Min 1Q Median 3Q Max
-0.33812 -0.19158 -0.03126 0.17354 0.55314

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18802 0.01024 18.36 <2e-16 ***
gg[, 3] 1.33408 0.03991 33.42 <2e-16 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2104 on 2413 degrees of freedom
Multiple R-squared: 0.3165, Adjusted R-squared: 0.3162
F-statistic: 1117 on 1 and 2413 DF, p-value: < 2.2e-16

summary(pw[,3]-gg[,3])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04813 0.05570 0.25610 0.26590 0.45200 0.74700

any(pw[,1]!=gg[,1])
[1] FALSE
any(pw[,2]!=gg[,2])
[1] FALSE