Hello everyone,
I am new to metagenomics bioinformatic analysis, as I’m willing to study soil samples microbial diversity. I have paired-end FASTQs sequenced with a kit that can target three regions (16S V3–V4, 16S V4–V5, and ITS). I want to split each sample into separate FASTQ pairs per region (V3–V4, V4–V5, ITS) and then analyze accordingly. I’m unsure whether my data actually contain non-V3–V4 amplicons. I’ve contacted the lab, and what they told me was that they don’t have exact idea about primer sequences used for each region, and instead have performed demultiplexing, and shared with me their results (on other samples).
So, I used these primers to check my fastq files, and also performed an exploratory analysis to check the overrepresented sequences in my data, if I can myself get those primers.
What I observed was that:
1- FastQC “overrepresented sequences”: matches V3–V4 primers.
2- R script (ShortRead/Biostrings; first ≤120 bp, IUPAC, ≤2 mismatches):
-
R1 shows sizeable hits for V3–V4 F (≈16–22% across samples).
-
R2 shows sizeable hits for the V3–V4 reverse primer (R) (≈20%).
-
R1 also shows non-trivial hits for V4–V5 F (≈10–27%) and small for ITS F (≈0.2–0.9%), but R2 does not show the matching reverse primer for those regions (0 read)
Does this evidence conclusively indicate that only V3–V4 is present?
I really appreciate your help. Thanks !