Loss of bases with filter.seqs

Hello,

We’re running mothur on 16S gene sequences produced on the Illumina platform. I’ve attempted to optimize my screen.seqs step, and I no longer run into the error where filter.seqs removes every column. However, the sequences that come out of the filter.seqs step are about half the length of those after screen.seqs. Below are my input commands and summaries after screen.seqs and filter.seqs:

screen.seqs(fasta=filename.trim.unique.align, name=filename.trim.names, group=filename.groups, minlength=75, start=40877, end=41432)

summary.seqs
Start End NBases Ambigs Polymer NumSeqs
Minimum: 40158 41432 75 0 2 1
2.5%-tile: 40339 41444 76 0 3 5644
25%-tile: 40727 41488 76 0 3 56431
Median: 40781 41488 79 0 3 112861
75%-tile: 40877 41547 79 0 3 169291
97.5%-tile: 40877 41562 79 0 5 220078
Maximum: 40877 42531 148 0 6 225721
Mean: 40825 41513 77.9236 0 3.25359

of unique seqs: 4979

total # of seqs: 225721

filter.seqs(fasta=filename.trim.unique.good.align, vertical=T, trump=.)

summary.seqs
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 57 22 0 2 1
2.5%-tile: 1 65 25 0 2 5644
25%-tile: 1 65 25 0 2 56431
Median: 1 65 28 0 2 112861
75%-tile: 1 65 28 0 3 169291
97.5%-tile: 1 65 33 0 3 220078
Maximum: 5 65 39 0 5 225721
Mean: 1.00314 64.9997 27.1742 0 2.43922

of unique seqs: 4979

total # of seqs: 225721

How could the filter step chop the sequences in half? Do the number of bases in the summary after screen.seq include blank columns (.) as well as actual bases? Is there a way I can screen out sequences that small in my screen.seqs step, if that’s the case? If it’s not the case, how do I keep my ~75-80bp sequences?

Thank you very much!

~Alexa

Hi Alexa,

Basically what’s happening is that your output of screen sees looks something like…

....ATGCATGCATGC...
ATGCATGCATGC.......
..GCATGCATGCAT.....

After running filter.seqs(vertical=T, trump=.) you get…

ATGCATGC
ATGCATGC
ATGCATGC

I would suggest playing with the start and end values. You can open the summary file that you generated with summary.seqs and see what combination of start and end value gives you the most sequences using something like R or excel.

Pat