Mixed 16S/ITS kit but reads look overwhelmingly V3–V4 — how to verify and subset per region?

Hello everyone,

I am new to metagenomics bioinformatic analysis, as I’m willing to study soil samples microbial diversity. I have paired-end FASTQs sequenced with a kit that can target three regions (16S V3–V4, 16S V4–V5, and ITS). I want to split each sample into separate FASTQ pairs per region (V3–V4, V4–V5, ITS) and then analyze accordingly. I’m unsure whether my data actually contain non-V3–V4 amplicons. I’ve contacted the lab, and what they told me was that they don’t have exact idea about primer sequences used for each region, and instead have performed demultiplexing, and shared with me their results (on other samples).

So, I used these primers to check my fastq files, and also performed an exploratory analysis to check the overrepresented sequences in my data, if I can myself get those primers.

What I observed was that:

1- FastQC “overrepresented sequences”: matches V3–V4 primers.

2- R script (ShortRead/Biostrings; first ≤120 bp, IUPAC, ≤2 mismatches):

  • R1 shows sizeable hits for V3–V4 F (≈16–22% across samples).

  • R2 shows sizeable hits for the V3–V4 reverse primer (R) (≈20%).

  • R1 also shows non-trivial hits for V4–V5 F (≈10–27%) and small for ITS F (≈0.2–0.9%), but R2 does not show the matching reverse primer for those regions (0 read)

Does this evidence conclusively indicate that only V3–V4 is present?

I really appreciate your help. Thanks !

Hi Oty,

Ugh. What a pain for you. I’d strongly suggest you find a vendor that will give you the information you want as well as data from only one region at a time. This will be a mess. I strongly encourage that people only use the V4 region for reasons explained here. Yes that post is 10+ years old. Nothing has changed :slight_smile:

It is hard to say whether you should expect the primers to be present without knowing how the sequencing was done. For example using our Kozich approach you will not see the primers we used to sequence the reads. Other methods resequence the primers for every read.

If you’re finding primers they should be at the beginning of the sequence read, not embedded in them. For example, you will probably find the V4-V5 forward primer embedded in V3-V4 sequences since the two regions overlap.

My suggestion would be to take your forward reads and align them to a reference using mothur’s align.seqs() function and then use summary.seqs() to see where those sequences start in the alignment. They should start at three general locations corresponding to where the V3F, V4F, and ITSF primers align. You could also repeat this with the reverse reads although you’ll likely need to use flip = T to align the reverse complement of the R2 reads.

Hope this is helpful,
Pat