unaligned after align.seqs

hello

I´m sure this a recurrent problem but i`m missing something that i cannot figure out.

I used just a forward primer in 454 amplification to V6 region.

After:

trim.seqs(fasta=F2.fna, oligos=barcode.oligos, qfile=F2.qual, maxambig=0, maxhomop=8, bdiffs=1, pdiffs=2,qwindowaverage=35, qwindowsize=50, minlength=250, maxlength=500)

I get:

mothur > summary.seqs(fasta=F2.fasta)

Start End NBases Ambigs Polymer
Minimum: 1 250 250 0 3
2.5%-tile: 1 278 278 0 4
25%-tile: 1 361 361 0 5
Median: 1 392 392 0 5
75%-tile: 1 423 423 0 5
97.5%-tile: 1 491 491 0 6
Maximum: 1 500 500 0 8

of Seqs: 14628


After command [b]uniq.seqs[/b]:

mothur > summary.seqs(fasta=F2.unique.fasta)

Start End NBases Ambigs Polymer
Minimum: 1 250 250 0 3
2.5%-tile: 1 278 278 0 4
25%-tile: 1 360 360 0 5
Median: 1 392 392 0 5
75%-tile: 1 425 425 0 5
97.5%-tile: 1 491 491 0 6
Maximum: 1 500 500 0 8

of Seqs: 13593

After command align.seqs(candidate=F2.unique.fasta, template=silva.bacteria.fasta, flip=t)
(none sequence was flipped)

mothur > summary.seqs(fasta=F2.unique.align)

Start End NBases Ambigs Polymer
Minimum: 1044 5271 221 0 3
2.5%-tile: 1044 6104 277 0 4
25%-tile: 1044 8186 360 0 5
Median: 1044 8507 392 0 5
75%-tile: 1044 9964 425 0 5
97.5%-tile:1044 13862 491 0 6
Maximum: 1467 15641 500 0 8

of Seqs: 13593

This summary shows there is a problem in the alignment. It seems there is a perfect alignment in the left column (i can cut all the sequences that do not start on 1044) but, on the other hand, there are serious problems on the right side. If i reject all the sequences that do not represent the median values i will lost a serious amount of data.

How can i improve this?

Thanks for your help.

As your sequences still have various length(ranging from 250 to 500) this is perfectly fine. Off topic. How Manu sequences did have before trimming?

I had about 16K sequences. I lost almost 2000 sequences in the trimming process. Do you think i could have defined a minimum lenght too high? The average lenght is 390.

i followed the common procedure and after filter.seqs(trump=., vertical=T) i had:

Start End NBases Ambigs Polymer
Minimum: 1 529 117 0 3
2.5%-tile: 50 544 123 0 3
25%-tile: 50 544 142 0 4
Median: 50 544 146 0 5
75%-tile: 50 544 155 0 5
97.5%-tile: 53 544 165 0 6
Maximum: 108 544 210 0 8

of Seqs: 13593

I think this is right, isn´t it?

Only 2000 sequences. It is very few sequences to lose after such a stringent trimming. Are these sequences you work with in some way preprocessed? Another thing I wondering about is that after the filter.seqs command your median length (146bp) is almost 100 bp. below your min. length in the trimming process. also there seems to be problems as some of your sequences do not start at position 1 after filtering.

No, i dont think so. The raw files were given to me by the sequencing facility without any pretreatment.


And you are right, after align.seqs command, the sequences should start at position 1.

However, i was able to “correct” this alignment problem, following a simple procedure described in sipos et al. “Robust computational analysis of rRNA hypervariable tag datasets”. This article state that a consistent alignment could be achieved merging an alignment produced by NAST+Silva (mothur) with an alignment made by infernal algorith (RDP). They developed a tool that can merge the best alignments from each tool by combining the hypervariable regions aligned using the NAST algorithm with the regions of strong secondary structure aligned by infernal.

I did this and after filter.seqs i have:

Start End NBases Ambigs Polymer
Minimum: 1 505 213 0 3
2.5%-tile: 1 505 219 0 4
25%-tile: 1 505 247 0 4
Median: 1 505 255 0 5
75%-tile: 1 505 260 0 5
97.5%-tile: 1 505 274 0 6
Maximum: 1 505 281 0 8

of Seqs: 12705

I think this is right now. What do you think about this strategy?

thanks