pairwise.seqs

jeffpowell · March 1, 2011, 10:41am

Hi all. The pairwise.seqs command appears to estimating distances incorrectly. I noticed this while calculating pairwise distances for some fungal data and replicated the result with a subset of the esophagus dataset. The distances are, on average, 23 % larger than those calculated after aligning against core_set_aligned.imputed.fasta and 29 % larger than using ESPRIT (using the same gapopen [10] and gapextend [-0.5] parameters as the ESPRIT default), with a large range (0 - 73 % larger).

Here’s the mothur screen output. I was running the 32-bit version on my MacBook Pro. I can supply additional files if they are needed.

Best regards,
Jeff

mothur v.1.17.0 Last updated: 2/25/2011

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type ‘help()’ for information on the commands that are available

Type ‘quit()’ to exit program

mothur > sub.sample(fasta=esophagus.fasta)

Sampling 71 from 710.

Output File Names:
esophagus.subsample.fasta

mothur > unique.seqs(fasta=esophagus.subsample.fasta)
Output File Names: esophagus.subsample.unique.fasta esophagus.subsample.names
mothur > align.seqs(candidate=esophagus.subsample.unique.fasta, template=core_set_aligned.imputed.fasta)
Reading in the core_set_aligned.imputed.fasta template sequences... DONE. Aligning sequences from esophagus.subsample.unique.fasta ... 70 It took 3 secs to align 70 sequences.
Output File Names: esophagus.subsample.unique.align esophagus.subsample.unique.align.report
mothur > dist.seqs(fasta=esophagus.subsample.unique.align)

0 0
69 0

Output File Name:
esophagus.subsample.unique.dist

It took 0 to calculate the distances for 70 sequences.

mothur > system(cp esophagus.subsample.unique.dist esophagus.align.dist)

mothur > pairwise.seqs(fasta=esophagus.subsample.unique.fasta)

Processing sequences from esophagus.subsample.unique.fasta …
0 0
69 39
It took 39 to calculate the distances for 70 sequences.

Output File Name:
esophagus.subsample.unique.dist

mothur > system(cp esophagus.subsample.unique.dist esophagus.pairwise.dist)
mothur > read.dist(column=esophagus.align.dist, name=esophagus.subsample.names)

********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||

It took 0 secs to read

mothur > cluster()

unique 2 69 1
0.00 9 50 3 2 0 0 0 0 0 1
0.01 16 23 7 1 1 1 1 0 0 0 0 0 0 0 0 0 1
0.02 17 15 9 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1
0.03 19 14 6 2 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1
0.04 20 11 4 4 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.06 20 11 3 3 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.07 21 10 3 3 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.08 24 10 3 2 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.09 24 10 2 2 1 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.12 33 10 2 2 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.13 33 8 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.18 33 6 2 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.21 33 5 2 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.24 35 4 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.25 35 1 2 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.27 35 1 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
0.29 54 1 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.30 57 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.32 59 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.33 62 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.37 71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Output File Names:
esophagus.align.fn.sabund
esophagus.align.fn.rabund
esophagus.align.fn.list

It took 0 seconds to cluster

mothur > read.dist(column=esophagus.pairwise.dist, name=esophagus.subsample.names)

********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||

It took 0 secs to read

mothur > cluster()

unique 2 67 2
0.00 10 46 4 1 1 0 0 0 0 0 1
0.01 17 29 6 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.02 17 23 9 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.03 17 20 9 2 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.04 18 19 8 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
0.05 18 18 7 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
0.07 18 18 7 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1
0.10 18 18 6 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.11 18 16 4 3 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.13 18 14 5 3 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.15 18 13 5 3 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
0.18 18 11 6 3 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
0.25 18 7 8 3 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
0.30 18 4 8 4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
0.35 20 4 7 4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.41 20 3 6 5 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.43 20 3 4 5 2 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.44 20 2 3 6 2 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.47 20 2 2 6 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.48 20 1 2 3 2 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.72 20 0 0 3 3 0 3 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0.73 24 0 0 2 2 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
0.74 24 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
0.75 24 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1
0.76 31 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
0.77 31 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
0.78 71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Output File Names:
esophagus.pairwise.fn.sabund
esophagus.pairwise.fn.rabund
esophagus.pairwise.fn.list

It took 0 seconds to cluster

mothur > pairwise.seqs(fasta=esophagus.subsample.unique.fasta, align=needleman, gapopen=-10, gapextend=-0.5)

Processing sequences from esophagus.subsample.unique.fasta …
0 0
69 36
It took 36 to calculate the distances for 70 sequences.

Output File Name:
esophagus.subsample.unique.dist

mothur > system(cp esophagus.subsample.unique.dist esophagus.pwGapParam.dist)
mothur > read.dist(column=esophagus.pwGapParam.dist, name=esophagus.subsample.names)

********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||

It took 0 secs to read

mothur > cluster()

unique 2 67 2
0.00 10 46 4 1 1 0 0 0 0 0 1
0.01 17 29 6 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.02 17 21 10 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.03 17 18 10 2 1 0 1 0 0 0 0 0 0 0 0 0 0 1
0.04 18 17 9 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
0.05 18 16 8 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
0.07 18 16 7 2 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1
0.10 18 16 7 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.12 18 15 6 2 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.13 18 13 4 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.18 18 11 5 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.23 18 9 6 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.26 18 7 7 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.27 18 5 8 4 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.28 18 5 6 4 2 1 0 0 0 0 0 1 0 0 0 0 0 0 1
0.32 20 5 5 4 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.35 20 4 4 5 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.38 20 2 5 5 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.39 20 2 3 5 3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.41 20 1 3 5 2 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0.46 23 1 3 4 2 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.51 23 0 1 4 2 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.61 23 0 1 4 1 2 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.68 23 0 1 3 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.70 23 0 1 2 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1
0.71 23 0 0 2 0 1 0 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1
0.72 23 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1
0.75 34 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
0.76 53 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.77 58 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0.78 71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Output File Names:
esophagus.pwGapParam.fn.sabund
esophagus.pwGapParam.fn.rabund
esophagus.pwGapParam.fn.list

It took 0 seconds to cluster

mothur > quit()

pschloss · March 1, 2011, 1:55pm

Thanks, Jeff. Could you send us a couple of the sequences that are causing this issue (mothur.bugs - gmail)?

pschloss · March 4, 2011, 1:51pm

Hi Jeff,

So I think there’s a bit of apples to oranges going on here. First off, by default pairwise.seqs runs the needleman algorithm, which doesn’t use the gap extension value and by default counts differences at the ends between bases and gaps. I’m pretty sure that ESPRIT uses gotoh, which does use gap extension values (also available within mothur) and ignores gaps at the ends (also available within mothur). Here’s some things I looked at to address your query with the data you sent me…

mothur "#align.seqs(candidate=esophagus.subsample.unique.fasta, template=core_set_aligned.imputed.fasta);dist.seqs(fasta=esophagus.subsample.unique.align)" mv esophagus.subsample.unique.dist gg.dist

mothur “#pairwise.seqs(fasta=esophagus.subsample.unique.fasta)”
mv esophagus.subsample.unique.dist pw.default.dist

The R2 between pw.default.dist and gg.dist was: 0.9909458
The slope between pw.default.dist(y) and gg.dist(x) was: 0.8821

mothur "#pairwise.seqs(fasta=esophagus.subsample.unique.fasta, gapopen=-10)" mv esophagus.subsample.unique.dist pw.need_10.dist

The R2 between pw.need_10.dist and gg.dist was: 0.926943
The slope between pw.need_10.dist(y) and gg.dist(x) was: 1.057

mothur "#pairwise.seqs(fasta=esophagus.subsample.unique.fasta, align=gotoh)" mv esophagus.subsample.unique.dist pw.gotoh.dist

The R2 between pw.gotoh.dist and gg.dist was: 0.9905227
The slope between pw.gotoh.dist(y) and gg.dist(x) was: 0.9149

mothur "#pairwise.seqs(fasta=esophagus.subsample.unique.fasta, align=gotoh, gapopen=-10, gapextend=-0.5)" mv esophagus.subsample.unique.dist pw.gotoh_10_05.dist

The R2 between pw.gotoh_10_05.dist and gg.dist was: 0.9891668
The slope between pw.gotoh_10_05.dist(y) and gg.dist(x) was: 0.977

The R2 between pw.gotoh_10_05.dist and pw.default.dist was: 0.9970326 The slope between pw.gotoh_10_05.dist(y) and pw.default.dist(x) was: 1.107

The R2 between pw.gotoh_10_05.dist and pw.need_10.dist was: 0.9466522
The slope between pw.gotoh_10_05.dist(y) and pw.need_10.dist(x) was: 0.9015

I *think* these last two comparisons describe what you are seeing. Incidentally, I'd add that the gap/open penalties of 10/0.5 seem to do a poorer job of replicating greengenes than the default values of 2.0/1.0 as judged from R2 values. Also, this does confirm what I wrote in an earlier PLoS Comp Bio paper that pairwise distances predict smaller distances than those derived from alignments to greengenes or the better alignment - SILVA.

Let me know if you still think there are problems or have further questions…
Pat

jeffpowell · March 4, 2011, 3:37pm

Hi Pat. Thanks for looking into this. I’m not sure what I’m doing wrong but I’m not able to replicate your results. Just looking at the first comparison I ran the same code on the file I sent earlier and got something vastly different from you.

/Applications/mothur “#align.seqs(candidate=esophagus.subsample.unique.fasta, template=core_set_aligned.imputed.fasta);dist.seqs(fasta=esophagus.subsample.unique.align)”
mv esophagus.subsample.unique.dist gg.dist

/Applications/mothur “#pairwise.seqs(fasta=esophagus.subsample.unique.fasta)”
mv esophagus.subsample.unique.dist pw.default.dist

This results in an R2 of 0.32 and slope (pw.default by gg) of 1.33 (shown below). This was using v.1.17.0 on my mac and I got the exact same result using v.1.15.0 on a PC. Any ideas?

From R:

gg<-read.table(‘gg.dist’,stringsAsFactors=F)
pw<-read.table(‘pw.default.dist’,stringsAsFactors=F)

str(gg)
‘data.frame’: 2415 obs. of 3 variables:
$ V1: chr “9_3_14” “65_7_25” “65_7_25” “65_9_4” …
$ V2: chr “9_3_22” “9_3_22” “9_3_14” “9_3_22” …
$ V3: num 0.0981 0.1015 0.0439 0.1004 0.0427 …
str(pw)
‘data.frame’: 2415 obs. of 3 variables:
$ V1: chr “9_3_14” “65_7_25” “65_7_25” “65_9_4” …
$ V2: chr “9_3_22” “9_3_22” “9_3_14” “9_3_22” …
$ V3: num 0.0518 0.0593 0.0404 0.0581 0.0392 …

summary(lm(pw[,3]~gg[,3]))

Call:
lm(formula = pw[, 3] ~ gg[, 3])

Residuals:
Min 1Q Median 3Q Max
-0.33812 -0.19158 -0.03126 0.17354 0.55314

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18802 0.01024 18.36 <2e-16 *
gg[, 3] 1.33408 0.03991 33.42 <2e-16 *

Signif. codes: 0 â€˜â€™ 0.001 â€˜â€™ 0.01 â€˜â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 0.2104 on 2413 degrees of freedom
Multiple R-squared: 0.3165, Adjusted R-squared: 0.3162
F-statistic: 1117 on 1 and 2413 DF, p-value: < 2.2e-16

summary(pw[,3]-gg[,3])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04813 0.05570 0.25610 0.26590 0.45200 0.74700

any(pw[,1]!=gg[,1])
[1] FALSE
any(pw[,2]!=gg[,2])
[1] FALSE

Topic		Replies	Views
dist.seqs bug? mothur bugs	18	15921	October 6, 2010
Average Clustering of ~10k unique V6 sequences Commands in mothur	10	8679	May 27, 2011
Pairwise.seqs Feature requests granted	0	4355	October 26, 2010
pairwise.seqs vs. Esprit Commands in mothur	1	3025	August 17, 2011
Pairwise aligner for OTUs Feature requests	11	14276	May 20, 2011

pairwise.seqs

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.18802 0.01024 18.36 <2e-16 *** gg[, 3] 1.33408 0.03991 33.42 <2e-16 ***

Related topics

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18802 0.01024 18.36 <2e-16 *
gg[, 3] 1.33408 0.03991 33.42 <2e-16 *