filter.seqs bug?

Hi there,
I am not sure It is a bug or me… :oops:
mothur > summary.seqs(fasta=16S_juvs_all.trim.rename.unique.align)


....so this what I had after the alignement (silva):

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 1046 2 0 1 1
2.5%-tile: 1044 5711 246 0 4 1638
25%-tile: 1044 8411 370 0 5 16377
Median: 1044 9964 406 0 5 32753
75%-tile: 1044 11888 446 0 5 49129
97.5%-tile:1044 13862 493 0 6 63868
Maximum: 43112 43116 531 0 8 65505
Mean: 1178.13 10195.5 397.359 0 4.89766

of Seqs: 65505



And so this is what I did for screening s:

mothur > screen.seqs(fasta=16S_juvs_all.trim.rename.unique.align, name=16S_juvs_all.trim.rename.names, minlength=200, start=1044)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 4710 200 0 3 1
2.5%-tile: 1044 6091 254 0 4 1605
25%-tile: 1044 8411 372 0 5 16043
Median: 1044 9964 406 0 5 32085
75%-tile: 1044 11888 446 0 5 48127
97.5%-tile: 1044 13862 493 0 6 62565
Maximum: 1044 15634 531 0 8 64169
Mean: 1044 10118.8 399.824 0 4.90717

of Seqs: 64169

…BUT then when I did the filter…I got shorter sequences!!:

mothur > filter.seqs(fasta=16S_juvs_all.trim.rename.unique.good.align, vertical=T, trump=.)


Start End NBases Ambigs Polymer NumSeqs Minimum: 1 650 130 0 3 1 2.5%-tile: 1 701 148 0 3 1605 25%-tile: 1 701 152 0 4 16043 Median: 1 701 161 0 5 32085 75%-tile: 1 701 168 0 5 48127 97.5%-tile: 1 701 187 0 6 62565 Maximum: 1 701 213 0 8 64169 Mean: 1 700.996 161.918 0 4.62301 # of Seqs: 64169

So then I just did without the trump and vertical and I got my length of seqeunces back…but obviously I have a longer alignement…How can I fix this??

mothur > filter.seqs(fasta=16S_juvs_all.trim.rename.unique.good.align)


Start End NBases Ambigs Polymer NumSeqs Minimum: 1 701 200 0 3 1 2.5%-tile: 1 993 254 0 4 1605 25%-tile: 1 1241 372 0 5 16043 Median: 1 1370 406 0 5 32085 75%-tile: 1 1516 446 0 5 48127 97.5%-tile: 1 1565 493 0 6 62565 Maximum: 1 1617 531 0 8 64169 Mean: 1 1360.34 399.824 0 4.90717 # of Seqs: 64169

Output File Name:
16S_juvs_all.trim.rename.unique.good.filter.summary


Thanks! Kim

Hi Kim,

The issue is that the V1 region is problematic because there tend to be lineages that have significant insertions/introns within the region (e.g. TM7 comes to mind). So if you calibrate everything by length, one 200 bp fragment may only go half way into the alignment while another will go much longer. Instead of setting minlength=200, can you try, end=5711? This way you know all of the sequences are spanning the same alignment length.

Pat

Great! thanks :smiley:

…But still have another question…
Why end at 5711 and not at 13862 or 8411?..
For example I have in another set of sequences this:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1109 12 0 2 1442
25%-tile: 1044 8365 285 0 4 14417
Median: 1044 9916 399 0 5 28833
75%-tile: 1044 13130 470 0 5 43249
97.5%-tile:43061 43116 501 0 6 56223
Maximum: 43116 43117 535 0 8 57664
Mean: 7144.03 14110.3 336.496 0 4.58171

of Seqs: 57664

So in this case I did :

mothur > screen.seqs(fasta=16S_ads_all.trim.rename.unique.align, name=16S_ads_all.trim.rename.names, group=16S_ads_all.trim.rename.groups, start=1044, end=8365)

And after filtering I got:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1039 325 0 3 1
2.5%-tile: 1 1042 351 0 4 849
25%-tile: 1 1042 381 0 5 8486
Median: 1 1042 383 0 5 16971
75%-tile: 1 1042 395 0 5 25456
97.5%-tile:1 1042 398 0 6 33093
Maximum: 1 1042 463 0 8 33941
Mean: 1 1042 384.009 0 5.07274

of Seqs: 33941

Output File Name:
16S_ads_all.trim.rename.unique.good.filter.summary

So looks good…but I have trouble understanding which length of alignement is best…

thanks
kim

Sure - that’s up to you. I suggested 5711 because that would allow you to use the most sequences for the length you seemed to be interested in. If you’re running these through trim.flows/shhh.flows then your reads will generally be in the 250-300 bp range.