All sequences removed with screen.seqs

Hello!
I am new to analysis and am using Mothur with paired-end MiSeq 16S V4 data for the first time. I was able to merge my forward and reverse reads which was summarized:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 294 294 0 3 1
2.5%-tile: 1 404 404 0 4 706035
25%-tile: 1 405 405 0 4 7060343
Median: 1 410 410 1 5 14120686
75%-tile: 1 424 424 3 5 21181028
97.5%-tile: 1 429 429 14 7 27535336
Maximum: 1 602 602 297 301 28241370
Mean: 1 415 415 2 4
# of Seqs: 28241370

I then attempted to remove ambiguous base calls and problematic sequences as recommended in the code where maxambig=0 and maxlength=275, but this eliminated all of my sequences:
It took 166 secs to screen 28241370 sequences, removed 28241370.

Could anyone possibly point out where I might be going wrong and/or how I can visualize the quality scores to determine the correct cutoff point for this data?
Thank you!
Lindsay

From what you’ve shown, it looks like your minimum sequence length was 294 sequences, which if you used maxlength=275 would remove all sequences. It looks like most of your sequences have length around 410, so would recommend changing your maxlength value to that.

Hello! Thank you so much for your response.
I set maxlength=410 and I still removed 61.38% of my sequences. I then did a trial of multiple higher max lengths ranging from 550-610 and it still removed 61.31% at a maxlength=610, which is higher than my maximum of 602. Any insight into how this could be would be much appreciated!
Thank you!
Lindsay

Hi Lindsay -

Are you sure you have V4 sequences? The V4 region is 250 nt long and even if your primers were still on the amplicon, I’d expect it to be smaller than 290 nt. Yours seem to be around 400-425 nt long. Also, it looks like you used the 2x300 chemistry since you have some 602 nt contigs, which would form if you had 2 301 nt reads stiched together. Perhaps you sequenced the V4 region with those long reads? If so, you likely sequenced off the end of the amplicon, which will increase the error rate and the number of Ns. You might try re-running make.contigs with trimoverlap=T to get rid of the region that extends beyond the end of the amplicon. For what it’s worth, I strongly discourage the use of the 2x300 chemistry since it results in pretty garbage data for amplicons. If you’re sequencing the V4 region, all you need is the 2x250 chemistry which will also save you some $

Pat

Hi Pat! Thanks so much for the response! From what I am told, the data is a 2 × 300 base pair paired-end protocol of the V4 region (515F, 806R). I appreciate the insight into appropriate chemistries based on the sequencing region. That definitely makes sense.

I did rerun the make.contigs with trimoverlap=T, which produced this summary:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1 1 0 1 1
2.5%-tile: 1 172 172 0 3 706009
25%-tile: 1 178 178 0 4 7060089
Median: 1 191 191 0 4 14120177
75%-tile: 1 197 197 1 5 21180265
97.5%-tile: 1 199 199 13 6 27534345
Maximum: 1 304 304 71 36 28240353
Mean: 1 186 186 1 4

of Seqs: 28240353

Then, I redid screen.seqs with maxambig=0 and maxlength=275, and only removed 43.85% of the sequences this time. This still seems like a large number removed. Do you think the chemistry used just resulted in a reduced sequence quality, thus resulting in increased removed sequences?
Thank you! Lindsay

Hi Lindsay,

Yeah, I worry that the sequencing quality wasn’t good enough. Try using the data from using ambig=0 and see where that gets you.

Pat

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.