Problem with silva.bacteria.fasta file for alignment

Hi,

I’ve noticed a problem that occurs with the silva.bacteria.fasta file used for doing alignments. For my analysis I used the pcr.seqs command to specifically select for the V6 region of the gene using my sequencing primers on the silva.bacteria.fasta file then used that file for the alignment of my sequences. After finishing my analysis I found that there were far more OTUs in some of my samples than was possible. I made my quality checks and sequencing as stringent as possible but still observed this phenomenon. Finally I looked the actual alignment of the silva.bacteria.fasta file and noticed that the alignment had some very obvious errors.

Using the RDP Infernal alignment completely eliminated the problem.

Here’s the readout from mothur on these two files.

mothur > summary.seqs(fasta=silva.bacteria.fasta, processors=2)

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 43061 1252 0 4 1
2.5%-tile: 1044 43116 1399 0 5 374
25%-tile: 1044 43116 1438 0 5 3740
Median: 1044 43116 1454 0 5 7479
75%-tile: 1044 43116 1465 0 6 11218
97.5%-tile: 1044 43116 1490 3 7 14583
Maximum: 1044 43116 1712 5 9 14956
Mean: 1044 43116 1449.59 0.362931 5.49104

of Seqs: 14956

Output File Names:
silva.bacteria.summary


mothur > pcr.seqs(fasta=silva.bacteria.fasta, oligos=primersr.txt, keepprimer=true, keepdots=false)

Using 2 processors.
Processing sequence: 100
Processing sequence: 100
Processing sequence: 200
Processing sequence: 200
Processing sequence: 300
Processing sequence: 300
Processing sequence: 400
Processing sequence: 400
Processing sequence: 500
Processing sequence: 500
Processing sequence: 600
Processing sequence: 600
Processing sequence: 700
Processing sequence: 700
Processing sequence: 800
Processing sequence: 800
Processing sequence: 900
Processing sequence: 900
Processing sequence: 1000Processing sequence: 1000

Processing sequence: 1100
Processing sequence: 1100
Processing sequence: 1200
Processing sequence: 1200
Processing sequence: 1300
Processing sequence: 1300
Processing sequence: 1400
Processing sequence: 1400
Processing sequence: 1500
Processing sequence: 1500
Processing sequence: 1600
Processing sequence: 1600
Processing sequence: 1700
Processing sequence: 1700
Processing sequence: 1800
Processing sequence: 1800
Processing sequence: 1900
Processing sequence: 1900
Processing sequence: 2000
Processing sequence: 2000
Processing sequence: 2100
Processing sequence: 2100
Processing sequence: 2200
Processing sequence: 2200
Processing sequence: 2300
Processing sequence: 2300
Processing sequence: 2400
Processing sequence: 2400
Processing sequence: 2500
Processing sequence: 2500
Processing sequence: 2600
Processing sequence: 2600
Processing sequence: 2700
Processing sequence: 2700
Processing sequence: 2800
Processing sequence: 2800
Processing sequence: 2900
Processing sequence: 2900
Processing sequence: 3000
Processing sequence: 3000
Processing sequence: 3100
Processing sequence: 3100
Processing sequence: 3200
Processing sequence: 3200
Processing sequence: 3300
Processing sequence: 3300
Processing sequence: 3400
Processing sequence: 3400
Processing sequence: 3500
Processing sequence: 3500
Processing sequence: 3600
Processing sequence: 3600
Processing sequence: 3700
Processing sequence: 3700
Processing sequence: 3800
Processing sequence: 3800
Processing sequence: 3900
Processing sequence: 3900
Processing sequence: 4000
Processing sequence: 4000
Processing sequence: 4100
Processing sequence: 4100
Processing sequence: 4200
Processing sequence: 4200
Processing sequence: 4300
Processing sequence: 4300
Processing sequence: 4400
Processing sequence: 4400
Processing sequence: 4500
Processing sequence: 4500
Processing sequence: 4600
Processing sequence: 4600
Processing sequence: 4700
Processing sequence: 4700
Processing sequence: 4800
Processing sequence: 4800
Processing sequence: 4900
Processing sequence: 4900
Processing sequence: 5000
Processing sequence: 5000
Processing sequence: 5100
Processing sequence: 5100
Processing sequence: 5200
Processing sequence: 5200
Processing sequence: 5300
Processing sequence: 5300
Processing sequence: 5400
Processing sequence: 5400
Processing sequence: 5500
Processing sequence: 5500
Processing sequence: 5600
Processing sequence: 5600
Processing sequence: 5700
Processing sequence: 5700
Processing sequence: 5800
Processing sequence: 5800
Processing sequence: 5900
Processing sequence: 5900
Processing sequence: 6000
Processing sequence: 6000
Processing sequence: 6100
Processing sequence: 6100
Processing sequence: 6200
Processing sequence: 6200
Processing sequence: 6300
Processing sequence: 6300
Processing sequence: 6400
Processing sequence: 6400
Processing sequence: 6500
Processing sequence: 6500
Processing sequence: 6600
Processing sequence: 6600
Processing sequence: 6700
Processing sequence: 6700
Processing sequence: 6800
Processing sequence: 6800
Processing sequence: 6900
Processing sequence: 6900
Processing sequence: 7000
Processing sequence: 7000
Processing sequence: 7100
Processing sequence: 7100
Processing sequence: 7200
Processing sequence: 7200
Processing sequence: 7300
Processing sequence: 7300
Processing sequence: 7400
Processing sequence: 7400
Thread Processing sequence: 7478
Processing sequence: 7478

Output File Names:
silva.bacteria.pcr.fasta
silva.bacteria.bad.accnos
silva.bacteria.scrap.pcr.fasta


It took 70 secs to screen 14956 sequences.

mothur > summary.seqs()
Using silva.bacteria.pcr.fasta as input file for the fasta parameter.

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 2979 88 0 2 1
2.5%-tile: 1 2980 93 0 3 238
25%-tile: 1 2980 95 0 3 2379
Median: 1 2980 96 0 3 4757
75%-tile: 1 2980 96 0 4 7135
97.5%-tile: 1 2980 101 0 6 9275
Maximum: 1 2981 157 4 9 9512
Mean: 1 2980 95.8464 0.0279647 3.71131

of Seqs: 9512

Output File Names:
silva.bacteria.pcr.summary


mothur > screen.seqs(maxambig=0, maxhomop=8) Using silva.bacteria.pcr.fasta as input file for the fasta parameter.

Using 2 processors.
Processing sequence: 100
Processing sequence: 100
Processing sequence: 200
Processing sequence: 200
Processing sequence: 300
Processing sequence: 300
Processing sequence: 400
Processing sequence: 400
Processing sequence: 500
Processing sequence: 500
Processing sequence: 600
Processing sequence: 600
Processing sequence: 700
Processing sequence: 700
Processing sequence: 800
Processing sequence: 800
Processing sequence: 900
Processing sequence: 900
Processing sequence: 1000
Processing sequence: 1000
Processing sequence: 1100
Processing sequence: 1100
Processing sequence: 1200
Processing sequence: 1200
Processing sequence: 1300
Processing sequence: 1300
Processing sequence: 1400
Processing sequence: 1400
Processing sequence: 1500
Processing sequence: 1500
Processing sequence: 1600
Processing sequence: 1600
Processing sequence: 1700
Processing sequence: 1700
Processing sequence: 1800
Processing sequence: 1800
Processing sequence: 1900
Processing sequence: 1900
Processing sequence: 2000
Processing sequence: 2000
Processing sequence: 2100
Processing sequence: 2100
Processing sequence: 2200
Processing sequence: 2200
Processing sequence: 2300
Processing sequence: 2300
Processing sequence: 2400
Processing sequence: 2400
Processing sequence: 2500
Processing sequence: 2500
Processing sequence: 2600
Processing sequence: 2600
Processing sequence: 2700
Processing sequence: 2700
Processing sequence: 2800
Processing sequence: 2800
Processing sequence: 2900
Processing sequence: 2900
Processing sequence: 3000
Processing sequence: 3000
Processing sequence: 3100
Processing sequence: 3100
Processing sequence: 3200
Processing sequence: 3200
Processing sequence: 3300
Processing sequence: 3300
Processing sequence: 3400
Processing sequence: 3400
Processing sequence: 3500
Processing sequence: 3500
Processing sequence: 3600
Processing sequence: 3600
Processing sequence: 3700
Processing sequence: 3800
Processing sequence: 3700
Processing sequence: 3900
Processing sequence: 3800
Processing sequence: 4000
Processing sequence: 3900
Processing sequence: 4100
Processing sequence: 4000
Processing sequence: 4200
Processing sequence: 4100
Processing sequence: 4300
Processing sequence: 4200
Processing sequence: 4400
Processing sequence: 4300
Processing sequence: 4500
Processing sequence: 4400
Processing sequence: 4600
Processing sequence: 4500
Processing sequence: 4600
Processing sequence: 4700
Processing sequence: 4756
Processing sequence: 4700
Processing sequence: 4756

Output File Names:
silva.bacteria.pcr.good.fasta
silva.bacteria.pcr.bad.accnos


It took 3 secs to screen 9512 sequences.

mothur > summary.seqs()
Using silva.bacteria.pcr.good.fasta as input file for the fasta parameter.

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 2979 88 0 2 1
2.5%-tile: 1 2980 93 0 3 234
25%-tile: 1 2980 95 0 3 2332
Median: 1 2980 96 0 3 4663
75%-tile: 1 2980 96 0 4 6994
97.5%-tile: 1 2980 101 0 6 9091
Maximum: 1 2981 157 0 8 9324
Mean: 1 2980 95.8333 0 3.71332

of Seqs: 9324

Output File Names:
silva.bacteria.pcr.good.summary


mothur > summary.seqs(fasta=silva.bacteria.rdp.fasta)

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 903 1103 82 0 2 1
2.5%-tile: 903 1103 87 0 3 286
25%-tile: 903 1103 89 0 3 2853
Median: 903 1103 90 0 3 5705
75%-tile: 903 1103 92 0 4 8557
97.5%-tile: 903 1103 96 0 6 11123
Maximum: 903 1103 151 0 8 11408
Mean: 903 1103 90.2805 0 3.68058

of Seqs: 11408

Output File Names:
silva.bacteria.rdp.summary

This is entirely possible, however…

  1. The infernal alignment is not aligning the variable region. If you’re doing the V6 region this is even more likely. Look at your alignment. I bet you have a bunch of bases that are in lower case text. That indicates they weren’t aligned. If you look at Table 2 of http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000844 you’ll see that in the V6 region the RDP alignments generate distances that are about 9% higher than the SILVA distances.

  2. You can look at the report file and see which reference sequence that sequence is mapping to and go fix it. Frankly, the one column shift you’re seeing is surprising. Typically a single base is shifted left or right a spot, but rarely the entire sequence off by a spot. Is it possible that you may have edited silva.bacteria.fasta?

Hmmm…the original alignment file looks okay along that region. However when I do pcr.seqs with these primers, I still see the problem. Maybe a bug with pcr.seqs then? It only occurs very rarely as the rest of the alignment looks fine.

forward
ACGCGARGAACCTTACC
reverse
CGACARCCATGCASCACCT

mothur > pcr.seqs(fasta=silva.bacteria.fasta, oligos=primersr.txt, keepprimer=true)

Using 2 processors.
Processing sequence: 100
Processing sequence: 100
Processing sequence: 200
Processing sequence: 200
Processing sequence: 300
Processing sequence: 300
Processing sequence: 400
Processing sequence: 400
Processing sequence: 500
Processing sequence: 500
Processing sequence: 600
Processing sequence: 600
Processing sequence: 700
Processing sequence: 700
Processing sequence: 800
Processing sequence: 800
Processing sequence: 900
Processing sequence: 900
Processing sequence: 1000
Processing sequence: 1000
Processing sequence: 1100
Processing sequence: 1100
Processing sequence: 1200
Processing sequence: 1200
Processing sequence: 1300
Processing sequence: 1300
Processing sequence: 1400
Processing sequence: 1400
Processing sequence: 1500
Processing sequence: 1500
Processing sequence: 1600
Processing sequence: 1600
Processing sequence: 1700
Processing sequence: 1700
Processing sequence: 1800
Processing sequence: 1800
Processing sequence: 1900
Processing sequence: 1900
Processing sequence: 2000
Processing sequence: 2000
Processing sequence: 2100
Processing sequence: 2100
Processing sequence: 2200
Processing sequence: 2200
Processing sequence: 2300
Processing sequence: 2300
Processing sequence: 2400
Processing sequence: 2400
Processing sequence: 2500
Processing sequence: 2500
Processing sequence: 2600
Processing sequence: 2600
Processing sequence: 2700
Processing sequence: 2700
Processing sequence: 2800
Processing sequence: 2800
Processing sequence: 2900
Processing sequence: 2900
Processing sequence: 3000
Processing sequence: 3000
Processing sequence: 3100
Processing sequence: 3100
Processing sequence: 3200
Processing sequence: 3200
Processing sequence: 3300
Processing sequence: 3300
Processing sequence: 3400
Processing sequence: 3400
Processing sequence: 3500
Processing sequence: 3500
Processing sequence: 3600
Processing sequence: 3600
Processing sequence: 3700
Processing sequence: 3700
Processing sequence: 3800
Processing sequence: 3800
Processing sequence: 3900
Processing sequence: 3900
Processing sequence: 4000
Processing sequence: 4000
Processing sequence: 4100
Processing sequence: 4100
Processing sequence: 4200
Processing sequence: 4200
Processing sequence: 4300
Processing sequence: 4300
Processing sequence: 4400
Processing sequence: 4400
Processing sequence: 4500
Processing sequence: 4500
Processing sequence: 4600
Processing sequence: 4600
Processing sequence: 4700
Processing sequence: 4700
Processing sequence: 4800
Processing sequence: 4800
Processing sequence: 4900
Processing sequence: 4900
Processing sequence: 5000
Processing sequence: 5000
Processing sequence: 5100
Processing sequence: 5100
Processing sequence: 5200
Processing sequence: 5300
Processing sequence: 5200
Processing sequence: 5300
Processing sequence: 5400
Processing sequence: 5500
Processing sequence: 5400
Processing sequence: 5600
Processing sequence: 5500
Processing sequence: 5700
Processing sequence: 5600
Processing sequence: 5800
Processing sequence: 5700
Processing sequence: 5900
Processing sequence: 6000
Processing sequence: 5800
Processing sequence: 6100
Processing sequence: 6200
Processing sequence: 5900
Processing sequence: 6300
Processing sequence: 6000
Processing sequence: 6400
Processing sequence: 6500
Processing sequence: 6100
Processing sequence: 6600
Processing sequence: 6200
Processing sequence: 6700
Processing sequence: 6300
Processing sequence: 6800
Processing sequence: 6900
Processing sequence: 6400
Processing sequence: 7000
Processing sequence: 6500
Processing sequence: 7100
Processing sequence: 6600
Processing sequence: 7200
Processing sequence: 6700
Processing sequence: 7300
Processing sequence: 6800
Processing sequence: 7400
Processing sequence: 6900
Processing sequence: 7478
Processing sequence: 7000
Processing sequence: 7100
Processing sequence: 7200
Processing sequence: 7300
Processing sequence: 7400
Thread Processing sequence: 7478

Output File Names:
silva.bacteria.pcr.fasta
silva.bacteria.bad.accnos
silva.bacteria.scrap.pcr.fasta


It took 131 secs to screen 14956 sequences.

mothur > summary.seqs()
Using silva.bacteria.pcr.fasta as input file for the fasta parameter.

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 31144 34123 89 0 2 1
2.5%-tile: 31144 34124 94 0 3 238
25%-tile: 31144 34124 96 0 3 2379
Median: 31144 34124 97 0 3 4757
75%-tile: 31144 34124 97 0 4 7135
97.5%-tile: 31144 34124 102 0 6 9275
Maximum: 31145 34124 158 4 9 9512
Mean: 31144 34124 96.846 0.0280698 3.71131

of Seqs: 9512

Output File Names:
silva.bacteria.pcr.summary

And thanks for the information about the infernal aligner, I did wonder about the lower case letters. I will try a start, end cutoff with pcr.seqs as I know where these primers are suppose to align and maybe that will eliminate the problem.

Thanks,

Elizabeth

Thanks for bringing this to our attention. I found the issue in pcr.seqs. For forward primer trimming with aligned sequences and keepdots=t. If the character before the first primer base was a base and not a gap the base was not trimmed. I fixed the problem and have uploaded version 1.30.2 to the wiki, http://www.mothur.org/wiki/Download_mothur.