screen.seqs and filter.seqs

Hi,

I am having some trouble when running screen.seqs and filter.seqs on my 16S seqs after alignment to the SILVA reference alignment.

Here is the summary of my alignment:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 1103 10 0 2 1
2.5%-tile: 1044 6332 313 0 4 5658
25%-tile: 1044 8508 397 0 5 56573
Median: 1044 9890 424 0 5 113145
75%-tile: 1044 9994 434 0 5 169717
97.5%-tile: 1044 10351 454 0 6 220631
Maximum: 43097 43116 485 0 8 226288
Mean: 1042.56 9323.6 410.339 0 5.00103

of unique seqs: 62445

total # of seqs: 226288

Then I ran screen.seqs as follows:

screen.seqs (fasta=2d.shhh.trim.pick.unique.align, name=2d.shhh.trim.pick.names, group=2d.shhh.groups, end=6332, minlength=300, processors=2)

Here is a summary of the output of the screen.seqs command:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 6332 300 0 3 1
2.5%-tile: 1044 6388 317 0 4 1485
25%-tile: 1044 7941 370 0 5 14849
Median: 1044 9820 412 0 5 29698
75%-tile: 1044 9914 430 0 5 44546
97.5%-tile: 1044 10303 455 0 6 57910
Maximum: 3616 13875 485 0 8 59394
Mean: 1044.53 8966.52 399.971 0 5.03881

of Seqs: 59394

Next I ran filter.seqs with the trump command:

filter.seqs(fasta=2d.shhh.trim.pick.unique.good.align, vertical=T, trump=., processors=2)

Here is the summary of the filter.seqs output:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 402 131 0 3 1
2.5%-tile: 1 404 141 0 3 1485
25%-tile: 1 404 151 0 4 14849
Median: 1 404 153 0 4 29698
75%-tile: 1 404 157 0 5 44546
97.5%-tile: 11 404 168 0 6 57910
Maximum: 26 404 194 0 8 59394
Mean: 2.87526 404 154.161 0 4.32683

of Seqs: 59394

As you can see, I am losing a lot of length after running the filter command! Is there any way I might improve on this so that I can keep around 300-400 bases of the alignment whilst not reducing my total no. of sequences too much? Any advice would be much appreciated and I apologise for the long post!

Many thanks

Hi,

The problem is…

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 6332 300 0 3 1
2.5%-tile: 1044 6388 317 0 4 1485
25%-tile: 1044 7941 370 0 5 14849
Median: 1044 9820 412 0 5 29698
75%-tile: 1044 9914 430 0 5 44546
97.5%-tile: 1044 10303 455 0 6 57910
Maximum: 3616 13875 485 0 8 59394
Mean: 1044.53 8966.52 399.971 0 5.03881

of Seqs: 59394

The "Maximum line indicates that you have a sequence that starts at 3616, ends after 6332 and is longer than 300 bp. So essentially, you’re only going to be looking at the bases that run between positions 3616 and 6332. How about trying this instead?


screen.seqs (fasta=2d.shhh.trim.pick.unique.align, name=2d.shhh.trim.pick.names, group=2d.shhh.groups, start=1044, end=6332, processors=2)
Pat