TL;DR - Why don’t we (in the SOP) discard any sequences/alignments which are (extremely) shorter or longer than the majority?
From the help page and MiSeq SOP:
…the following command will remove those sequences that don’t start by position X and do end before position Y…
…to get sequences that start at or before position X and end at or after position Y…
Why are sequences that start before our start= parameter kept? Is this generally suggested, even in extreme cases like the one I outlined below?
Because I have the following case, after aligning to the V4 reference:
mothur > summary.seqs(fasta=run066_09122016.contigs.good.unique.align, count=run066_09122016.contigs.good.count_table)
Using 48 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1231 3 0 1 1
2.5%-tile: 1968 11550 251 0 4 336016
25%-tile: 1968 11550 253 0 4 3360158
Median: 1968 11550 253 0 5 6720316
75%-tile: 1968 11550 253 0 6 10080474
97.5%-tile: 1968 11550 254 0 12 13104616
Maximum: 13422 13425 274 0 26 13440631
Mean: 1969.45 11549.8 252.805 0 5.56044
# of unique seqs: 1044318
total # of seqs: 13440631
mothur > screen.seqs(fasta=run066_09122016.contigs.good.unique.align, count=run066_09122016.contigs.good.count_table, summary=run066_09122016.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)
all the sequences which start at 1 were kept:
mothur > summary.seqs(fasta=run066_09122016.contigs.good.unique.good.align, count=run066_09122016.contigs.good.good.count_table)
Using 48 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 11550 229 0 3 1
2.5%-tile: 1968 11550 251 0 4 316516
25%-tile: 1968 11550 253 0 4 3165154
Median: 1968 11550 253 0 5 6330308
75%-tile: 1968 11550 253 0 6 9495461
97.5%-tile: 1968 11550 253 0 8 12344099
Maximum: 1968 13425 274 0 8 12660614
Mean: 1967.98 11550.1 252.794 0 5.19034
# of unique seqs: 1007196
total # of seqs: 12660614
I have 62 sequences which have an alignment from 1 to 11550 with 273 bases… that just can’t be right?
Would the easiest solution be to use screen.seqs(ostart)? However I don’t have a contigs.report file since I did not run make.contigs. Is there any other way I can remove them (apart from manually removing them by parsing contigs.good.unique.good.align and contigs.good.good.count_table) or should I just ignore them and carry on?
I’m still curious about the reasoning why the command keeps certain start positions, but in the meantime I decided to just filter out all sequences/alignments that don’t exactly start at 1968 and end at 11550 because there were a few apart from the few that started at 1.
mothur > summary.seqs(fasta=run066_09122016.contigs.good.unique.good.align_only1968, count=run066_09122016.contigs.good.good.count_table_only1968)
Using 48 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1968 11550 229 0 3 1
2.5%-tile: 1968 11550 251 0 4 316451
25%-tile: 1968 11550 253 0 4 3164508
Median: 1968 11550 253 0 5 6329015
75%-tile: 1968 11550 253 0 6 9493522
97.5%-tile: 1968 11550 253 0 8 12341579
Maximum: 1968 11550 274 0 8 12658029
Mean: 1968 11550 252.793 0 5.19007
# of unique seqs: 1005839
total # of seqs: 12658029
This eliminated some 2k unique seqs from the dataset, not that many in the end. What speaks against doing this?
PS. Now I’m thinking about removing any sequences which are longer than the expected ~250bp V4 amplicon. In the SOP, shorter and longer sequences are still kept, why?
Thanks!