Short sequences following screen.seqs and filter.seqs

Hello all,

I have a couple of 16S datasets obtained from gut biopsies that I am trying to analyze. However, in both datasets I am getting really poor classification past the phylum level. I’ve noticed that following the screening/filtering steps after the alignment, I end up with really short sequences (from 186bp before the alignment to 67bp after filtering). I have tried playing around with the screen.seqs parameters, and have also tried aligning to the GreenGenes database, but I keep getting the same results. I have the same problem with both datasets.

For example, here is a summary of what I have following the alignment step in one of my datasets:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 1053 5 0 2 1
2.5%-tile: 1044 1213 55 0 3 1728
25%-tile: 1044 2532 123 0 4 17272
Median: 1044 4978 197 0 4 34544
75%-tile: 1044 5413 246 0 5 51815
97.5%-tile: 1044 6409 322 0 6 67359
Maximum: 43111 43116 360 0 7 69086
Mean: 1077.25 4201.27 186.622 0 4.268

of Seqs: 69086


After running: screen.seqs(fasta=calgary.trim.unique.align,name=calgary.trim.names,group=calgary.groups,start=1044,optimize=end,criteria=85) filter.seqs(fasta=calgary.trim.unique.good.align,vertical=T,trump=.,processors=12)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 237 45 0 2 1
2.5%-tile: 1 237 52 0 2 1565
25%-tile: 1 237 65 0 3 15648
Median: 1 237 68 0 4 31296
75%-tile: 1 237 71 0 4 46944
97.5%-tile: 1 237 75 0 4 61027
Maximum: 1 237 108 0 6 62591
Mean: 1 237 67.2747 0 3.46726

of Seqs: 62591

Does anyone have any ideas what could be going wrong here?

Thanks in advance!

Kyle

So you have the start position right. The problem is that your sequences are generally pretty short - 25% of your reads are less than 123 bases. You might need to change criteria to equal 50. You probably also want to ask your sequencing facility why the reads were so bad.

As you probably have guessed, short reads will not classify very well.

Pat

Thanks Pat, I appreciate your input.

I was hoping it was something that I was doing wrong, rather than a data quality issue.

Kyle

from the filter.seqs page

NOTE: having one or two sequences included that don’t align with the bulk of your sequences may lead to all columns being removed by the trump option!

I often have a few bad seqs slip through to this step (like your maximal sequence) if you use trump=. in that case you’ll throw away a lot of good bases because they don’t overlap with the few bad seqs