screen.seqs removed 47% of my sequences

amgomez · July 17, 2013, 7:25pm

Hi,
After using screen. seqs in the new 454 SOP, I ended up with only 53% of my original sequences, and after running all the pipeline many of my samples (that I had already ran through the pipline) ended up with <~1,000 sequences when I had obtained >~7,000.
After aligning this is what I got:
mothur > summary.seqs(fasta=current, name=current)

Using All_Howler.unique.align as input file for the fasta parameter.
Using All_Howler.names as input file for the name parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 11895 239 0 4 27921
25%-tile: 1044 11895 438 0 4 279203
Median: 1065 13870 479 0 5 558405
75%-tile: 1482 13870 491 0 5 837607
97.5%-tile: 5431 13870 512 0 6 1088888
Maximum: 43116 43116 535 0 7 1116808
Mean: 1757.19 13191.9 445.846 0 4.87609

of unique seqs: 888386

total # of seqs: 1116808

Output File Names:
All_Howler.unique.summary[/u][/u][/i]

[b]Then I ran screen.seqs in this way:

mothur > screen.seqs(fasta=current, name=current, group=All_Howler.groups, end=13870, optimize=start, criteria=95, processors=8)

After wards this is what I got:[/b]
mothur > summary.seqs(fasta=current, name=current)

Using All_Howler.unique.good.align as input file for the fasta parameter.
Using All_Howler.good.names as input file for the name parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 13870 245 0 3 1
2.5%-tile: 1044 13870 290 0 4 15829
25%-tile: 1044 13870 441 0 5 158288
Median: 1071 13870 482 0 5 316576
75%-tile: 1468 13870 497 0 5 474864
97.5%-tile: 5260 13871 516 0 6 617323
Maximum: 5341 16283 535 0 7 633151
Mean: 1578.76 13881.8 457.626 0 5.01822

of unique seqs: 535028

total # of seqs: 633151

Output File Names:
All_Howler.unique.good.summary

[b]As you can see I lost ~47% of all my sequences. I can't think of any other step that could have caused my low number of reads at the end.

Can somebody give me any clue as to whether the screen.seqs command seemed to be well written?

Thanks a lot.

A[/b]

dwaite · July 17, 2013, 8:49pm

Try set your end to 11895. This confused me a lot when I started using mothur, and I always have to stop and think when I’m explaining it to people, but if I remember correctly:

For your start column, summary.seqs() is reporting what % of your reads have begun by this alignment position. If you screen using (from your data) start=5431 it will include all your reads that have started at this position, as well as those that started earlier. If you pick the 2.5% position, then it will pick all your reads that have started by this position or earlier, which is a lot less.

This is reversed when you look at the end positions, although the logic is the same. If you pick the 97.5% value for your stop position, you’re screening for reads that have basically sequenced to or past this position. Since this is the 97.5 percentile (not 97.5% of your reads), only 2.5% of your reads actually have sequenced past this point, and those that are shorter are discarded. If this doesn’t make sense, imagine that you picked the maximum stop value as your screening parameter. You’d be asking mothur to keep sequences that were at least as long as the maximum sequence in your reads, so would only end up with a single read (or maybe a few that had the same length). If you pick the 2.5 percentile, then only 2.5% of your reads didn’t make it this far and are discarded. Reads that sequenced past this point are kept so you end up with far more (although you will lose the information past this point, either in this step or in filter.seqs - I forget which).

I think this is correct. If Pat has a better explanation (or I’m wrong) I’ll write it down for next time :mrgreen:

amgomez · July 17, 2013, 9:14pm

Thanks a lot dwaite, I think I got it. And yes it is confusing.

amgomez · July 18, 2013, 5:05pm

Well, I set my end to 11895, and it still optimizes start to 5341, same when I set my end to 13870., which is weird, I suspect I will end up loosing again the same % of sequences but the script hasn’t finished running. I’m a little confused.

pschloss · July 18, 2013, 8:27pm

Vinny…

So some more details are needed. What direction did you sequence from? If you put in an E. coli sequence that was amplified with your primers, what would the start and stop positions be? Usually there is a clear starting and ending point - you have neither. Try this and see what you get…

screen.seqs(fasta=current, name=current, group=All_Howler.groups, end=13870, start=1468, processors=8)

Also, I assume you know I would scold you about not denoising your sequences with shhh.flows first…

Pat

amgomez · July 18, 2013, 9:01pm

It worked
Now, setting end to 11895 and optimizing start I have ~94.5% of my sequences. The end criteria was the confusing factor fro me, but I’ve got it, thanks Pat, thanks dwaite.
A

Topic		Replies	Views
screen.seqs hugh loss of seqs? Commands in mothur	2	2337	March 25, 2013
Screen.seq fail Commands in mothur	2	30	September 13, 2024
Screen.seqs is removing most of my sequences mothur bugs	3	1506	October 17, 2020
Screen.seqs getting rid of most sequences Commands in mothur	3	544	March 1, 2021
screen.seqs after alignment gives problematic results mothur bugs	3	2089	November 30, 2015

screen.seqs removed 47% of my sequences

of unique seqs: 888386

of unique seqs: 535028

Related topics