Several sequences missing primers and many ambiguous base calls

Hello,

I have been working with a set of environmental samples that we submitted to a sequencing facility for Illumina MiSeq sequencing. We sequenced the V4 region using 515F and 806R primers. Since receiving the dataset, we have run into a number of odd things in the dataset.

  1. The forward and reverse primers are only present in about 72% of sequences (searched for in BioStrings), which forces us to remove a large portion of the sequences

  2. When I work through the mothur workflow to screen sequences (with screen.seqs command), maxambig = 0 removes about half of my sequences even though the multiQC report does not indicate that the per base N content should be that high. The report does indicate that some of the R2 reads have about 5% sequences with ambiguous base calls at around 22 bp, right after the reverse primer from 1-20. However, generally, it seems like screen.seqs should not be removing half of all sequences.

> summary.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table)

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       246     246     0       2       1
2.5%-tile:      1       291     291     0       4       119929
25%-tile:       1       292     292     0       4       1199290
Median:         1       292     292     0       4       2398580
75%-tile:       1       292     292     1       6       3597870
97.5%-tile:     1       481     481     24      17      4677231
Maximum:        1       502     502     148     251     4797159
Mean:   1       316     316     2       5
# of unique seqs:       4797159
total # of seqs:        4797159

> screen.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table, maxambig=0)

Running command: remove.seqs(accnos=stability.trim.contigs.bad.accnos.temp, count=stability.contigs.count_table)
Removed 2192434 sequences from stability.contigs.count_table.

We confirmed with the sequencing facility that the forward and reverse primers should still be present, but we are unsure how to troubleshoot why only 72% of sequences have primers. Additionally, it is unclear at what step in the process caused several R2 reads to have this ambiguous base and why screen.seqs is removing so many sequences with maxambig=0. We are thinking it might have something to do with library prep, but I have not been able to find information on the internet that indicates what exactly causes this.

Thank you!

Hi there,

Your question…

1. The forward and reverse primers are only present in about 72% of sequences

It sounds like your sequencing was done from the illumina adapters out through the indices and primers, right? If ~25% of your sequences don’t have the primers, then I wonder what those sequences are that lack the primers. Can you go into the sequences and find the primers manually? If you blast the sequences, what do they come back as? I wonder if you have PhiX contamination - this is a control that is inserted into the libraries and removed later. Is it possible that the barcode sequence is confusiong BioStrings’ ability to detect the primer? Feel free to post a few sequences that are getting tossed that you think shouldn’t be along with the index and primer sequences.

2. When I work through the mothur workflow to screen sequences (with screen.seqs command), maxambig = 0 removes about half of my sequences

Your individual reads likely do not have Ns in them. It is only after you assemble the reads with make.contigs that the ambiguous base calls will appear. I would wonder what the metrics were on your sequencing. What % were passing filter? What was your %Q30? What version of the sequencing chemistry were you using? Is it possible they used 2x300 cycles to sequence the region? If this is the case, that would be an indicator of why the sequence quality is poor

Feel free to post answers to these questions and we can go from there…
Pat

Hi Pat -

It looks like the majority of the reads “without primers” do actually have primers in them but they are highly degenerate (> 20% error) which is still concerning. When I run cutadapt to filter out sequences that do not have either a forward or reverse primer (e = 0.2), it removes around 20% of sequences. If I change cutadapt to only a trim a sequence if neither the forward nor reverse primers are present, it removes only about 2-3%. I have included below a couple examples of sequences where neither the R1 nor R2 reads had primers that passed e = 0.2 in cutadapt.

I am not sure exactly what the sequencing facility uses other than they told me it was a “MiSeq 500bp kit run as a 2x251 +barcodes”. I have also attached the multiQC report quality plot (this report was generated before removing sequences without primers).

I am also interested in knowing whether this may be an issue caused by the sequencing facility, or if it is a problem on our end with library prep or something related if you have any thoughts on that.

Thank you so much!

  • Sample sequences:
    • R1: TTGTCAGCAGCCGCGTTGTCCCTGTCTCTTTTTTTCTTCTTCTTTCTCTCTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTCTT

    • R2: TTTCTACGCGGCTGCTGACACCNGTCTCTTATACACATCTGACGCTGCCGACGAATCATGCGGTGTAAAGCTCGGTGGCTGCCGTGTCTTTTATAATAAAATATCTTATTGTACATCAGATATTGTGGTGAACTATACACATTATTATGTGACTGCATCATTCGTTACATTAAGGAAGTAGATATCTCACAATCTCTTTTTGTCACCTTTATGTGGCGTCATCACCTGCTGCAGACACGGTCTAACTGTAT

    • R1: AGGGAAGCAGCCGGGAAAATCCCTGGCAGTAGCACAAATATGCGAACCGACAAAACGCAGACTTAGCGCGTGCGGATGAACCAGGTTAAAAAAAGAACTAACCTTGCGAGTCATTTCTTGGATTTGGACATTGGTAAAATACTGACCAGCCGTTTGAGCTGGAGTAAGAATTTGGCGCATAATCGAGGAAACCTGCTGTTGCTGGGAAAGATTGGTGTTTTCCATAATAGACGCAACGCGAGCAGTAGACT

    • R2: GAACGAACGAGGAGACAGAGACCGNCGAGATGAAAAAAGCGAACGCAAAAGAGAAAGCAGGCAGGGAGGAGGCGAGGGCACGCCAGAAAAGGAAAGACAAGGAATAAGAAAAAAAAGAGAAGCTGGCGTATCAACAGAAGGAGTCTAAGGCTCGAGTAGAGTCGATTAGGGAAAAAACCAAACTTTCAAAGCAACAGCAGGTTACCGAGATTAAGCGAAAAAGGAAGACACAAGATAAAAAGGAGGGGCAG

    • R1: GTGCCAGCCCCGAGTAGTCCCTGTCTCTTATACACATCTCCGAGCCCACGAGACTCAGGCTTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAGAAAAAAAATAACATAGAAGATATTAGGCAGTTAATAAGCAAAAGAAAGCATACGATAGCGGTCTAAAGCGAATAGGCGTAGAGGTGTAACTATAAGCGAAGAAGAAACAAGAAGATTGTAGGATAGAGATATGACTACTAGCTATG

    • R2: GGACTACTCGGGGCTGGCACCTGTCTCTTATACACATCTGACGCTGCCGACGAATCATGCGGGGGAGATGTCTGGGGGCGGGGGGTCGTTTAAAAAAAAAAAAAACAAAAAAATAAATAAATGACTAGTAGTATTAACGTAATAGGAAGTGCAGTAAAAGAGTACAGAGATGAGAGCGAGAGTATGGGACAGGAAATCACAGCATGAGTAAACATAGGTGCGTCGAGTATGTACATTACACAGCAGTTCAG

Control sample sequences:

  • R1: GTGTCAGCGCCGTGCCGGAGAGGAGGCCCACCGTCCCGGCAACTCCCCCAATCAAAGCCGCCTCAGCCAGCATCACGCAGAGGATCTGACCGCGTGACGCCCCGAGTAGTCCCTGTCTCTTATACACATCTCCGAGCCCACGAGACATGACGTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAACAAAAGATAATCCACTCATAACATGACTTAGACTGACTTACAGAGATCGAGGCAATATCAGTGT

  • R2: GGACTACTCGGGGCGTCACGCGNTCAGATCCTCTGCGTGATGCTGGCTGAGGCGGCTTTGATTGGGGGAGTTGCCGGGACGGTGGGCCTCCTCTCCGGCACGGCGCTGACACCTGTCTCTTATACACATCTGACGCTGCCGACGACTTCCTTCGTGTGGAGTTCTGCTGGCGGCGGAACTTTTTAAAAAACAAAAGTAAGTCTGTGAGGTGCGAAGGGGGAATGTGGAGTGTGTGCAGTATTATGTTGCTG

  • R1: GGACTACACATCTCCGAGCCCACGAGACATGACGTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAGATAGAAATAATACAAAAGGGAACGGATTTAGAGTGAAGATTGAGAAAGTGGATGATAAGTTGATACATATATGGGTGTGTTGGTAGCGAGTTATTTAGGGTGCGGAATTGACTGAGTAAATATGATTATGACCCTACAAGTATCACGAAGGTGTGTGATACGTACTTGTAGATGATTGTAG

  • R2: GGAGTCGTTGTTTGTTGAGTGCGTGTTGTTAAATCAGTGTTCTTTCTATACTGTCATTGCACCGTTTGTGTCTGGCGGTATCAGTTGGTGTTCTACGTATTATATATTCTATTAGCGTGTCCACTAGATATTAGTGCAGCTTAAAGAAGTTATAAGACAAGCTGCTACAACAGAGTAGTTGATGTTTGGTGTATATGTTGTGTGTGATAGTGTGTGTACAGCATGTATGAGTGTGAAGCGTATCGAGTGAA

  • R1: GGACTACCCGGGTTTCTTATACACATCTCCGAGCCCACGAGACATGACGTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAACAGAACAAAAAGAAGAAATAAATAAAGTGAGTAAGAAATATAAAAAGTGGAGAAAAAGGAGCAGTAAGGCAAAAGGAGAATGACGAGGAGAGGACAATAGATTGTGAGATGAAGTAGATAATAGAAGAGAGAAGGACGATAGGAACAGTGAGGGACGA

  • R2: GTAGAAGTATATCACGAGCAGGGAAGAGTGGATATAAAAACAGAGGAGACCCCGGCTCTGATAGCACTCTTACGCTGCCGTCGACTCTCTACTGGTCGGTATACGGGTGTGGGGCATCGCCAAGTCAGACCAAGTAGGACTCACCAATAGCAAGTATGACGAGCAGGAATGCGGTAGACAGCAAGACGTTGAATACACAAAGAGTTGAGGAGTGTCCTAGTGTGTTTGTGGATGCACGCTACAGATGGTGA

Hey -

I never use cutadapt and strongly discourage its use. Use make.contigs instead. Sequences that are assembled after being quality trimmed are of poorer quality than leaving them untrimmed

That plot looks like a bad sequencing run. I’d ask what the percent passing filter was and how much PhiX they loaded. I suspect they ran it like a genome where you can get away with a lower percent PhiX and higher cluster density than you can wiht amplicons.

Pat

Hi Pat,

Okay, good to know on cutadapt. Also, the sequencing facility got back to me and they said “15% phix added and had a % pass filter of 83%”. Is that what it should look like?

Thank you!
Sophia

Hi Sophia,

Yeah those values seem typical. And you’re sure it was 2x250 and not 2x300, right?

Pat

The read length used for the sequencing run was 251bp+8bp+8bp+251bp (paired end)

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.