Removing bases occurring prior to primers in demultiplexed fastq files

Hello,

I received some sequence data that has been demultiplexed and the adapters and barcodes removed by the core lab prior to our receipt. The sequencing was run on Illumina MiSeq using the V4 primers (515F/806R) described by Caporaso et al. However, what is confusing me is that there are a few bases (ranging form zero to at least 3) at the start of the F and R reads occurring prior to the primers. I do not have much experience with primer removal in this context and was hoping to ask for thoughts on…

  • Is this normal/expected or would I expect to see the primers as the first 19 and 20 bases and this perhaps could reflect incomplete removal of the adapter/barcode/spacer during the demultiplexing and trimming?
  • Can I simply provide the primer sequences to the trim.seqs command and it will also remove any leading bases? If so, would I need to run this on each file prior to the make.contigs? Or can I simply use something like fastx trimmer to just remove say the first 23 bases or would this introduce artificial variation since each would have a different length (maybe only a problem…if it is a problem…for clustering/splitting algorithms that do not include a ref base alignment step prior to clustering?)?

Any thoughts on this matter would be greatly appreciated! I have included the first few lines of an example fasta file where I have underlined the primer sequences for the F and R reads to show what I am seeing. I believe I got the R primers correct given the degenerate primers.


Foward reads for example file: 515F: GTGCCAGCAGCCGCGGTAA G[u]GTGCCAGCAGCCGCGGTAA[/u]TACGTAGGTGGCGAGCGTTGTCCGGATTTACTGTGCGTAAAGAGAGCGTAGGCGGACTTTTAAGTGTGTTGTGAAATACTCGGCCTCAACTTCAGTGCTGCATTTCAAACTGGAAGTCTAGAGTGCAGAGGAGGAGAGTGGAATTCCTCGTGTAGCGGTGAAATGCGTGGTGATTAGGAAGAACACCAGTGGCGAGGGCGATTCTCTGGCCTGTAACTGCCGCTGAGGCTC

GTGCCAGCCGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTTTTTGTCGTAAAGGGAGGGCAGGCGGTGTCTTTAGGTTGAAGTGAAAGCACCCGGCCCAACCGGGAAGGCTCCATGGCAACTGGGAGGTTTGGGTGCCGAAGAGGGGGGGGGGATTCCATGTGTAGCGGTGAAATGCGTAGATGTATGGGGGAACACCAGTGGCGAAGGCGGCTCTCTGGTGTGGCACTGGAGCTGAGGCTCG

CGTGCCAGCCGCCGCGGTAATACGTAGGTGGCGAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGTAGGCGGGCTTTTAAGTGAGATGTGAAATACTCGGGCTCAACTTCAGTGCTGCATTGCAAACTGGAAGCTTAGGGTGCAGGAGAGGAGACTGGAATTCCTAGGTTAGCGGTGAAATGCGTAGTGATTAGGAAGAACACCAGTGGCGAAGGCGATTCTCTGCGCTGTAACTGCCGCTGGGGCTC


Reverse reads for example file 806R: GGACTACHVGGGTWTCTAAT TCT[u]GGATTTCGGGTGTATCTTAT[/u]CCTTNTTGCTCACCACGCTTTCGGTCCTCAGCGTCTGTTACAGACCAGAGAGCCGCCGTCGCCACTTGTGTTCTTCCTATTCTCTACGTCTTTCACCGCTCCACTAGGATTTCCATCCTCCTCTCCTGCACTCTAGTCTTCCGTTTTGACATGCATCGCTCCTGTTCAGCGCGGGTTTTCATCATCCTCCTTCAATGTCCGCCTCCGCCCTCTTTACCCTCATTAATC

TGCGGACTACGGGGTTTTCTAATCTTTNTTGCTCACCACTCTTTCGCGCCCCAGCGTCATTTAAAGACCAGAGACTCGCTTTCGCCACTGGTGTTCCTCCATATATTTACGCTTTTAACGGCTACACGAGGATTTCCACTCTCTTCTCCTGTACTTCATTCTACCGGTTTCCAAGGCCTCGCGCATGTGGAGCCCGAGGTTTTCACATCAGTCTTAAGAGACCTCCGTCGCTTTCTTTCCGCGCATTAATC

TCCGCAATACTCGTGTATCTTATCCTGNTCGCTCCCCACTCTTTCGTCCCTCAGCGTCAGTTCCAGCCCAGAGACTCGCCTTCGCCATGGGAGTTCTTCCTAATCTCTACGCATTTCACCGCTACACTGGGAATTCCACTCTCCTCTCCTGCACTCTAGTCTCCCTGTTTCACATGCACCGCTCGCGTTGAGCCCGTCTTTTTCACTTCTCTCTTCAAGCTCGCCCTTCGCCCTCTTAACCCCCATAAATC

TCAGGTCTACAGGGTTTTCATATCCTGNTTGCTCTCCACGCTTTCGACCCTAAGTGTCAGTTACAGCCCAGAGAGCCGCTTTCGTCACGGGTGTTCCTTCATCTATCTACGCATTTCACCGCTACACATGGATTTCCACTCCTCTCTTCTGCACTCAAGTCTCCCAGTTTCCAATGTCTCCCGCGTGTTGAGCCGGTGCCTTTCACCTCAGTCTCAAGTTACTGCCTGCGCCCTCTTCACGCACAAAAATT

ACTGGAATACCCGGGTATCATATCCTGNTTGCTCCCAACGCTTGCGATCCTCAGCGTCATTTACAGACCAGTGACCCGCTCTCGCCACTGGGGTTCCTCCATATATCTACGCATTTCACCGCTACACGTGGTATTCCACACTCCTCTTCTGTACTAAAGTCTCTCATTTTCCAAAGACTAGTCCCGGTTCAGCCGGGGTGTTTAACATCAGTCTCGAGAAACCCCCATCGTCTGCTTTGCGCACCTTCAAT

Ugh, what a pain - I’d ask them to stop doing that and to follow the Kozich protocol instead :slight_smile:

I’m not sure that you’ve underlined the primers - I think that might include the barcodes as well. Regardless, I don’t see very good agreement between the expected primer and your sequences. You might double check that… As a possible solution, can you try creating new primers/barcodes in your oligos files that have N’s tacked on to the beginning?

Pat