screen.seqs

I was following the Costello analysis.
I made the alignment, align.seqs(fasta=july.unique.fasta, reference=silva.bacteria.fasta, ksize=8, flip=T)
Here are the summary.seqs(fasta=july.unique.align) results:

Start End NBases Ambigs Polymer
Minimum: 1044 1044 1 0 1
2.5%-tile: 6214 13125 104 0 3
25%-tile: 6333 13125 143 0 5
Median: 6333 13125 156 0 5
75%-tile: 6414 13125 159 0 5
97.5%-tile: 7928 13125 181 0 7
Maximum: 43105 43116 461 0 31

of Seqs: 61990

I wanted to exclude anything that doesn’t start by 6214 and end by 13125. I used the following to accomplish this
screen.seqs(fasta=july.unique.align, name=july.names, group=july.groups, start=6214, end=13125, maxhomop=8)

Here are the summary.seqs(fasta=july.unique.good.align) results:

   Start End         NBases Ambigs Polymer

Minimum: 1115 13125 168 0 4
2.5%-tile: 5441 13125 172 0 4
25%-tile: 6107 13125 183 0 5
Median: 6107 13125 185 0 5
75%-tile: 6200 13125 193 0 5
97.5%-tile: 6214 13125 221 0 6
Maximum: 6214 13879 461 0 7

of Seqs: 941

The majority of my sequences wound up in the “bad” file and only 941 survived this screen, which is not what I expected based on the summary of the input file for screen.seqs and where I chose to set start/end. Also, of the ones that survived the start and end positions aren’t consistent with my input parameters of 6214 and 13125 respectively.

I repeated the final summary.seqs and added in the names file in case that was the problem:

mothur > summary.seqs(fasta=july.unique.good.align, name=july.good.names)

Start End NBases Ambigs Polymer
Minimum: 1115 13125 168 0 4
2.5%-tile: 6102 13125 173 0 4
25%-tile: 6107 13125 184 0 5
Median: 6107 13125 185 0 5
75%-tile: 6200 13125 197 0 5
97.5%-tile: 6212 13125 209 0 6
Maximum: 6214 13879 461 0 7

of unique seqs: 941

total # of seqs: 1685

My output is slighty different, but the error in the filter seems to persist. I also tried filtering by length instead of start/end

screen.seqs(fasta=july.unique.align, name=july.names, group=july.groups, minlength=104, maxlength=181, maxhomop=8)

Here are those results:

Start End NBases Ambigs Polymer
Minimum: 1044 2521 104 0 2
2.5%-tile: 6333 13125 122 0 4
25%-tile: 6333 13125 156 0 5
Median: 6333 13125 156 0 5
75%-tile: 6333 13125 159 0 5
97.5%-tile: 7694 13125 177 0 6
Maximum: 40797 41783 181 0 8

of unique seqs: 59407

total # of seqs: 488827

In this case the screen settings actually worked. I followed this up with

filter.seqs(fasta=july.unique.good.align, vertical=T, trump=.)

and got this result:

Start End NBases Ambigs Polymer
Minimum: 0 0 0 0 1
2.5%-tile: -1 -1 0 0 1
25%-tile: -1 -1 0 0 1
Median: -1 -1 0 0 1
75%-tile: -1 -1 0 0 1
97.5%-tile: -1 -1 0 0 1
Maximum: -1 -1 0 0 1

of Seqs: 59407

I’m guessing this is bad…really bad… I’m not sure what is going wrong in screen.seqs and why different parameters yield different results. Should I suspect my *.groups file? Is there something I am missing? Help much appreciated!!

You’re close… When you ran summary.seqs(fasta=july.unique.align), only 2.5% of your sequences started before 6214. So, it’s not surprising that you have so few sequences that got through. What you probably want is to run start=6414 or something more towards that end of the distribution.

Hope this helps,
Pat

Bummer, I was changing my first post when you replied. I see how I misunderstood “start” now. I’m going to try your suggestion and see what happens. Thanks.

Ok, I’m back in business. I’m still not sure if I understand why my minimum and maximum shows information below or above what I asked screen.seqs to do. I do have a working filter.seqs result now and have moved on to generating OTUs. Thank you.