screen.seqs(start) clarification

TL;DR - Why don’t we (in the SOP) discard any sequences/alignments which are (extremely) shorter or longer than the majority?

From the help page and MiSeq SOP:

…the following command will remove those sequences that don’t start by position X and do end before position Y…

…to get sequences that start at or before position X and end at or after position Y…

Why are sequences that start before our start= parameter kept? Is this generally suggested, even in extreme cases like the one I outlined below?

Because I have the following case, after aligning to the V4 reference:

mothur > summary.seqs(fasta=run066_09122016.contigs.good.unique.align, count=run066_09122016.contigs.good.count_table)

Using 48 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       1231    3       0       1       1
2.5%-tile:      1968    11550   251     0       4       336016
25%-tile:       1968    11550   253     0       4       3360158
Median:         1968    11550   253     0       5       6720316
75%-tile:       1968    11550   253     0       6       10080474
97.5%-tile:     1968    11550   254     0       12      13104616
Maximum:        13422   13425   274     0       26      13440631
Mean:   1969.45 11549.8 252.805 0       5.56044
# of unique seqs:       1044318
total # of seqs:        13440631

mothur > screen.seqs(fasta=run066_09122016.contigs.good.unique.align, count=run066_09122016.contigs.good.count_table, summary=run066_09122016.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)

all the sequences which start at 1 were kept:

mothur > summary.seqs(fasta=run066_09122016.contigs.good.unique.good.align, count=run066_09122016.contigs.good.good.count_table)

Using 48 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       11550   229     0       3       1
2.5%-tile:      1968    11550   251     0       4       316516
25%-tile:       1968    11550   253     0       4       3165154
Median:         1968    11550   253     0       5       6330308
75%-tile:       1968    11550   253     0       6       9495461
97.5%-tile:     1968    11550   253     0       8       12344099
Maximum:        1968    13425   274     0       8       12660614
Mean:   1967.98 11550.1 252.794 0       5.19034
# of unique seqs:       1007196
total # of seqs:        12660614

I have 62 sequences which have an alignment from 1 to 11550 with 273 bases… that just can’t be right?
Would the easiest solution be to use screen.seqs(ostart)? However I don’t have a contigs.report file since I did not run make.contigs. Is there any other way I can remove them (apart from manually removing them by parsing contigs.good.unique.good.align and contigs.good.good.count_table) or should I just ignore them and carry on?

I’m still curious about the reasoning why the command keeps certain start positions, but in the meantime I decided to just filter out all sequences/alignments that don’t exactly start at 1968 and end at 11550 because there were a few apart from the few that started at 1.

mothur > summary.seqs(fasta=run066_09122016.contigs.good.unique.good.align_only1968, count=run066_09122016.contigs.good.good.count_table_only1968)

Using 48 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1968    11550   229     0       3       1
2.5%-tile:      1968    11550   251     0       4       316451
25%-tile:       1968    11550   253     0       4       3164508
Median:         1968    11550   253     0       5       6329015
75%-tile:       1968    11550   253     0       6       9493522
97.5%-tile:     1968    11550   253     0       8       12341579
Maximum:        1968    11550   274     0       8       12658029
Mean:   1968    11550   252.793 0       5.19007
# of unique seqs:       1005839
total # of seqs:        12658029

This eliminated some 2k unique seqs from the dataset, not that many in the end. What speaks against doing this?

PS. Now I’m thinking about removing any sequences which are longer than the expected ~250bp V4 amplicon. In the SOP, shorter and longer sequences are still kept, why?


Thanks!

I’m still interested in the best practices for keeping/removing certain alignments :,(

TL;DR - Why don’t we (in the SOP) discard any sequences/alignments which are (extremely) shorter or longer than the majority?

These are removed, but it’s done implicitly in screen.seqs when you remove by start/stop positions. Short reads will not span both the start and stop position so get removed that way.

Why are sequences that start before our start= parameter kept? Is this generally suggested, even in extreme cases like the one I outlined below?

That’s because during the filter.seqs step, those overhangs are trimmed from the alignment anyway. It’s the same case with reads that go past the stop position - that data is trimmed from the alignment frame. The idea is that screening by start/stop removes sequences that don’t span this region completely (i.e. short sequences), and those that are retained then get trimmed down to a consensus alignment space.

I have 62 sequences which have an alignment from 1 to 11550 with 273 bases… that just can’t be right?

It’s not necessarily surprising. The SILVA 16S alignment model is about 50,000 positions for ~1,600 nucleotides. The space is mostly just empty gaps so you can allow insertions between any two positions in the reference. As long as the filtered alignment looks like a reasonable number of positions then it’s not a problem. If your filtered alignment is still large* then you might have poorly aligned regions.

  • It’s a bit arbitrary, but if the filtered alignment is 3 or 4 times longer than your reads, I usually give it a quick check in an alignment editor.