Screen.seqs is removing most of my sequences

I am having difficulties figuring out why screen.seqs is completely removing >99% of my reads. As per the SOP, you run screen.seqs to get sequences that start at or before/after the start/end positions. However, instead of keeping these sequences, it is completely eliminating them.

I am using Mothur 1.44.3

Sequence after aligning to the reference database:

summary.seqs(fastacurrent, count=current)

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	7	10	1	0	1	1
2.5%-tile:	987	1665	121	0	4	75423
25%-tile:	1006	1668	122	0	4	754226
Median: 	1007	1668	122	0	4	1508451
75%-tile:	1007	1672	123	0	4	2262676
97.5%-tile:	1028	1684	128	0	5	2941478
Maximum:	1718	1718	132	0	8	3016900
Mean:	1005	1669	122	0	4
# of unique seqs:	47954
total # of seqs:	3016900

screen.seqs(fasta=current, count=current, summary=current, start=1006, end=1684, maxhomop=8)
filter.seqs(fasta=current, vertical=T)
unique.seqs(fasta=current, count=current)
summary.seqs(fasta=current, count=current)

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	219	118	0	3	1
2.5%-tile:	17	219	123	0	3	558
25%-tile:	17	219	123	0	3	5579
Median: 	17	219	123	0	3	11158
75%-tile:	18	219	131	0	4	16736
97.5%-tile:	18	219	131	0	5	21757
Maximum:	26	238	132	0	7	22314
Mean:	17	219	126	0	3
# of unique seqs:	1456
total # of seqs:	22314

So as you can see it removed almost 3 million sequences.

I have opened the .align file after the alignment to the reference database and the region between 987 - 1006 is a bit gappy for the majority of the sequences, but the sequences are good after that so I do not want to get rid of them. I would like to trim the database so all sequences start and end at the designated positions. I have tried running pcr.seqs and I get the exact same results. I have also tried to run scree.seqs with different start positions (e.g, 987, 1007) but that too deletes most of my database. I’ve looked into trim.seqs but that appears to only be used for trimming off adaptors/primers etc.

I am lost as to how to move forward without losing these sequences. Any insight would be greatly appreciated.

The start column is the position at which sequences start. So picking 1006 will require your sequences to start at or before 1006. This would toss most of your sequences. Also, the end column is the position at which sequences end. So picking 1684 would require your sequences to end at or after 1684, again tossing most of your sequences. I’d suggest using start=1028 and end=1665. Give that a go and let us know how the output looks.

Pat

Hi Pat,

Thanks for the response and suggestion. It looks like the start and end poisitons that you suggested worked well, as I only lost about 90K sequences and not almost 3 million.

Thanks!

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 7 10 1 0 1 1
2.5%-tile: 987 1665 121 0 4 75423
25%-tile: 1006 1668 122 0 4 754226
Median: 1007 1668 122 0 4 1508451
75%-tile: 1007 1672 123 0 4 2262676
97.5%-tile: 1028 1684 128 0 5 2941478
Maximum: 1718 1718 132 0 8 3016900
Mean: 1005 1669 122 0 4

of unique seqs: 47954

total # of seqs: 3016900

mothur > screen.seqs(fasta=current, count=current, summary=current, start=1028, end=1665, maxhomop=8)

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 301 110 0 2 1
2.5%-tile: 21 301 121 0 4 73275
25%-tile: 26 303 122 0 4 732744
Median: 27 303 122 0 4 1465488
75%-tile: 27 304 123 0 4 2198231
97.5%-tile: 45 314 128 0 5 2857700
Maximum: 45 333 132 0 8 2930974
Mean: 26 304 122 0 4

of unique seqs: 42228

total # of seqs: 2930974

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.