I was following the Costello analysis.
I made the alignment, align.seqs(fasta=july.unique.fasta, reference=silva.bacteria.fasta, ksize=8, flip=T)
Here are the summary.seqs(fasta=july.unique.align) results:
Start End NBases Ambigs Polymer
Minimum: 1044 1044 1 0 1
2.5%-tile: 6214 13125 104 0 3
25%-tile: 6333 13125 143 0 5
Median: 6333 13125 156 0 5
75%-tile: 6414 13125 159 0 5
97.5%-tile: 7928 13125 181 0 7
Maximum: 43105 43116 461 0 31
of Seqs: 61990
I wanted to exclude anything that doesn’t start by 6214 and end by 13125. I used the following to accomplish this
screen.seqs(fasta=july.unique.align, name=july.names, group=july.groups, start=6214, end=13125, maxhomop=8)
Here are the summary.seqs(fasta=july.unique.good.align) results:
Start End NBases Ambigs Polymer
Minimum: 1115 13125 168 0 4
2.5%-tile: 5441 13125 172 0 4
25%-tile: 6107 13125 183 0 5
Median: 6107 13125 185 0 5
75%-tile: 6200 13125 193 0 5
97.5%-tile: 6214 13125 221 0 6
Maximum: 6214 13879 461 0 7
of Seqs: 941
The majority of my sequences wound up in the “bad” file and only 941 survived this screen, which is not what I expected based on the summary of the input file for screen.seqs and where I chose to set start/end. Also, of the ones that survived the start and end positions aren’t consistent with my input parameters of 6214 and 13125 respectively.
I repeated the final summary.seqs and added in the names file in case that was the problem:
mothur > summary.seqs(fasta=july.unique.good.align, name=july.good.names)
Start End NBases Ambigs Polymer
Minimum: 1115 13125 168 0 4
2.5%-tile: 6102 13125 173 0 4
25%-tile: 6107 13125 184 0 5
Median: 6107 13125 185 0 5
75%-tile: 6200 13125 197 0 5
97.5%-tile: 6212 13125 209 0 6
Maximum: 6214 13879 461 0 7
of unique seqs: 941
total # of seqs: 1685
My output is slighty different, but the error in the filter seems to persist. I also tried filtering by length instead of start/end
screen.seqs(fasta=july.unique.align, name=july.names, group=july.groups, minlength=104, maxlength=181, maxhomop=8)
Here are those results:
Start End NBases Ambigs Polymer
Minimum: 1044 2521 104 0 2
2.5%-tile: 6333 13125 122 0 4
25%-tile: 6333 13125 156 0 5
Median: 6333 13125 156 0 5
75%-tile: 6333 13125 159 0 5
97.5%-tile: 7694 13125 177 0 6
Maximum: 40797 41783 181 0 8
of unique seqs: 59407
total # of seqs: 488827
In this case the screen settings actually worked. I followed this up with
filter.seqs(fasta=july.unique.good.align, vertical=T, trump=.)
and got this result:
Start End NBases Ambigs Polymer
Minimum: 0 0 0 0 1
2.5%-tile: -1 -1 0 0 1
25%-tile: -1 -1 0 0 1
Median: -1 -1 0 0 1
75%-tile: -1 -1 0 0 1
97.5%-tile: -1 -1 0 0 1
Maximum: -1 -1 0 0 1
of Seqs: 59407
I’m guessing this is bad…really bad… I’m not sure what is going wrong in screen.seqs and why different parameters yield different results. Should I suspect my *.groups file? Is there something I am missing? Help much appreciated!!