screen.seqs removed 47% of my sequences

After using screen. seqs in the new 454 SOP, I ended up with only 53% of my original sequences, and after running all the pipeline many of my samples (that I had already ran through the pipline) ended up with <~1,000 sequences when I had obtained >~7,000.
After aligning this is what I got:

mothur > summary.seqs(fasta=current, name=current)

Using All_Howler.unique.align as input file for the fasta parameter.
Using All_Howler.names as input file for the name parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 11895 239 0 4 27921
25%-tile: 1044 11895 438 0 4 279203
Median: 1065 13870 479 0 5 558405
75%-tile: 1482 13870 491 0 5 837607
97.5%-tile: 5431 13870 512 0 6 1088888
Maximum: 43116 43116 535 0 7 1116808
Mean: 1757.19 13191.9 445.846 0 4.87609

of unique seqs: 888386

total # of seqs: 1116808

Output File Names:

[b]Then I ran screen.seqs in this way:

mothur > screen.seqs(fasta=current, name=current, group=All_Howler.groups, end=13870, optimize=start, criteria=95, processors=8)

After wards this is what I got:[/b]
mothur > summary.seqs(fasta=current, name=current)

Using All_Howler.unique.good.align as input file for the fasta parameter.
Using All_Howler.good.names as input file for the name parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 13870 245 0 3 1
2.5%-tile: 1044 13870 290 0 4 15829
25%-tile: 1044 13870 441 0 5 158288
Median: 1071 13870 482 0 5 316576
75%-tile: 1468 13870 497 0 5 474864
97.5%-tile: 5260 13871 516 0 6 617323
Maximum: 5341 16283 535 0 7 633151
Mean: 1578.76 13881.8 457.626 0 5.01822

of unique seqs: 535028

total # of seqs: 633151

Output File Names:

[b]As you can see I lost ~47% of all my sequences. I can't think of any other step that could have caused my low number of reads at the end.

Can somebody give me any clue as to whether the screen.seqs command seemed to be well written?

Thanks a lot.


Try set your end to 11895. This confused me a lot when I started using mothur, and I always have to stop and think when I’m explaining it to people, but if I remember correctly:

For your start column, summary.seqs() is reporting what % of your reads have begun by this alignment position. If you screen using (from your data) start=5431 it will include all your reads that have started at this position, as well as those that started earlier. If you pick the 2.5% position, then it will pick all your reads that have started by this position or earlier, which is a lot less.

This is reversed when you look at the end positions, although the logic is the same. If you pick the 97.5% value for your stop position, you’re screening for reads that have basically sequenced to or past this position. Since this is the 97.5 percentile (not 97.5% of your reads), only 2.5% of your reads actually have sequenced past this point, and those that are shorter are discarded. If this doesn’t make sense, imagine that you picked the maximum stop value as your screening parameter. You’d be asking mothur to keep sequences that were at least as long as the maximum sequence in your reads, so would only end up with a single read (or maybe a few that had the same length). If you pick the 2.5 percentile, then only 2.5% of your reads didn’t make it this far and are discarded. Reads that sequenced past this point are kept so you end up with far more (although you will lose the information past this point, either in this step or in filter.seqs - I forget which).

I think this is correct. If Pat has a better explanation (or I’m wrong) I’ll write it down for next time :mrgreen:

Thanks a lot dwaite, I think I got it. And yes it is confusing.

Well, I set my end to 11895, and it still optimizes start to 5341, same when I set my end to 13870., which is weird, I suspect I will end up loosing again the same % of sequences but the script hasn’t finished running. I’m a little confused.


So some more details are needed. What direction did you sequence from? If you put in an E. coli sequence that was amplified with your primers, what would the start and stop positions be? Usually there is a clear starting and ending point - you have neither. Try this and see what you get…

screen.seqs(fasta=current, name=current, group=All_Howler.groups, end=13870, start=1468, processors=8)

Also, I assume you know I would scold you about not denoising your sequences with shhh.flows first…


It worked
Now, setting end to 11895 and optimizing start I have ~94.5% of my sequences. The end criteria was the confusing factor fro me, but I’ve got it, thanks Pat, thanks dwaite.