Problem with 16S rRNA Archaea alignment

Hey all,

I am currently trying to align a large dataset of 16S archaea sequences. I went through a modified Schloss SOP. After the implemented command
“align.seqs(fasta=XX.chop.fasta, reference=silva.archaea.fasta)”
I noticed that 1/4 of my sequences didnt align at the front and back of the whole alignment implying in further steps that these sequences are trash. After carefully checking these sequences I noticed that they are most likely valid sequences (just don’t align well at the front and back)
Now I would run into the need of aligning these sequences “by hand”. Do you see another possibility? Is there maybe a “better”/modified alignment of the silva one?
I see this as not just a personal problem as with this problem alot of potential important phylogenetic information might get lost in the common standardized data analysis.

I hope I didn’t miss any important information.

So it’s likely that many of the references sequences in silva.archaea.fasta are not full length. You might need to customize a database to include the regions that you are interested in.

Hi,

I am also having this problem when running align.seqs using the silva.archaea.fasta reference.

After the alignment, I obtained this:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1044 1 0 1 167
25%-tile: 1044 1048 2 0 1 1667
Median: 6451 15968 4 0 2 3334
75%-tile: 43053 43116 17 0 3 5001
97.5%-tile: 43115 43116 220 0 6 6501
Maximum: 43116 43116 240 0 7 6667
Mean: 17464.2 19032.7 40.0118 0 2.32008

of unique seqs: 1428

total # of seqs: 6667

When I screened the sequences using start=43115 and then filtered these sequences, I obtained the following:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 165
25%-tile: -1 -1 0 0 1 1645
Median: -1 -1 0 0 1 3290
75%-tile: -1 -1 0 0 1 4934
97.5%-tile: -1 -1 0 0 1 6414
Maximum: -1 -1 0 0 1 6578
Mean: 2.80431e+15 2.80431e+15 0 0 1

of unique seqs: 1390

total # of seqs: 6578

Is there any update on how to fix this problem besides customizing my own template alignment file?

Thanks!

I wonder whether your sequences are backwards. Could you try running align.seqs with flip=T and see what happens to the output of summary.seqs?

Hi Patrick,

Thanks for your reply! Unfortunately, when I used align.seqs(fasta=349F450F.shhh.trim.unique.fasta, reference=silva.archaea.fasta,flip=t processors=2) I obtained the same outputs

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1044 1 0 1 183
25%-tile: 1044 1071 4 0 2 1821
Median: 43053 43116 11 0 2 3641
75%-tile: 43112 43116 17 0 3 5461
97.5%-tile: 43115 43116 294 0 6 7099
Maximum: 43116 43116 325 0 6 7281
Mean: 25765.1 27937.4 47.7781 0 2.4178

of unique seqs: 1712

total # of seqs: 7281

After filtering I saw this again

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 174
25%-tile: -1 -1 0 0 1 1731
Median: -1 -1 0 0 1 3462
75%-tile: -1 -1 0 0 1 5193
97.5%-tile: -1 -1 0 0 1 6750
Maximum: -1 -1 0 0 1 6923
Mean: 2.66456e+15 2.66456e+15 0 0 1

of unique seqs: 1540

total # of seqs: 6923

Do you have any other suggestion I could try?
Thanks!
Claudia

What region of the 16S are you sequencing? Can you post one of your sequences that isn’t aligning?

Hi Patrick,

I am using the following primers:

Arch349F GYGCASCAGKCGMGAAW
Arch806R GGACTACVSGGGTATCTAAT

Claudia

Can you post one of your sequences that isn’t aligning?

Hi Pat,

Here they are:

H4R0T2203HC35O
CATTCACAATGGGCGAAAGCCTGATGGTGTGACGCCGCGTGGAGGATGAAGGTCTTCGGATTGTAAACTCCTGTCAAGTGAGAGTAAAGCCGGGCGTTAACTGCGTGCCGGTTTGAGAGTATCACTAGAGGAAGAGACGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGGGTGCGTAGGCTGCATGGTAAGTCAGAGGTGAAATGCCGGAGCT

H4R0T2203G3KXL
AGTCTACAATGGACGAAAGTCTGATAGTGCGACGCCGCGTGAACGAAGAATCCTTTCGGGGTGTAAAGTTCTTTTCTATGTGAGCAGTTCCTTTTATATGAAGAGTATAGAAGGGGGGATTTAAGCATAGGAATAAGCAACGGCTAACTCCGTGCCAGCAGCCGCGGTAACACGGGGGTTGCAAGCGTTGTCCGGAATTACTAGGCGTAAAGTGCAGGTAGGCGGATAGATAAGTTTAAGGTTAAAGGCTACGGCTTACCCGTAGTAAGGCGTTAGATACTGTTTATCTGGAAT

H4R0T2203HFY6M
AGTCTACAATGGACGAAAGTCTGATAGTGCGACGCCGCGTGAACGAAGAATCCTTTCGGGGTGTAAAGTTCTTTTCTATGTGAGCAGTTCCTTTTATATGAAGAGTATAGAAGGGGGGATTTAAGCATAGGAATAAGCAACGGCTAACTCCGTGCCAGCAGCCGCGGTAACACGGGGGTTGCAAGCGTTGTCCGGAATTACTAGGCGTAAAGTGCAGGTAGGCGGATAGATAAGTTTAAGGTTAAAGGCTACGGCTTACCCGTAGTAAGGCGTTAGATACTGTTTATCTGGAA

Thanks!

These seem to be bacterial sequences, can you try aligning them to silva.bacteria.fasta?

Hi Pat,

Thanks for your comment!

The sequences worked just fine when I aligned them against the silva.bacteria.fasta file. Then, when I classified these new aligned sequences using the green genes files gg_13_8_99.fasta and gg_13_8_99.gg.tax they were all classified as bacteria.

Now my question is how can that be if the primers used were specific for Archaea (Arch 349F/Arch 806R). In my search for an answer I even found a paper where pyrosequencing was conducted at the same facility I sent my samples using the same primers. In this case, the study was able to successfully identify Archaea members.

Could it be that I am missing an important step in the pipeline?

Thanks in advance for your help!

Sometimes primers aren’t as specific as we hope for and sometimes weird things happen with low-abundance populations.

Pat