I have been working with a set of environmental samples that we submitted to a sequencing facility for Illumina MiSeq sequencing. We sequenced the V4 region using 515F and 806R primers. Since receiving the dataset, we have run into a number of odd things in the dataset.
The forward and reverse primers are only present in about 72% of sequences (searched for in BioStrings), which forces us to remove a large portion of the sequences
When I work through the mothur workflow to screen sequences (with screen.seqs command), maxambig = 0 removes about half of my sequences even though the multiQC report does not indicate that the per base N content should be that high. The report does indicate that some of the R2 reads have about 5% sequences with ambiguous base calls at around 22 bp, right after the reverse primer from 1-20. However, generally, it seems like screen.seqs should not be removing half of all sequences.
We confirmed with the sequencing facility that the forward and reverse primers should still be present, but we are unsure how to troubleshoot why only 72% of sequences have primers. Additionally, it is unclear at what step in the process caused several R2 reads to have this ambiguous base and why screen.seqs is removing so many sequences with maxambig=0. We are thinking it might have something to do with library prep, but I have not been able to find information on the internet that indicates what exactly causes this.
1. The forward and reverse primers are only present in about 72% of sequences
It sounds like your sequencing was done from the illumina adapters out through the indices and primers, right? If ~25% of your sequences don’t have the primers, then I wonder what those sequences are that lack the primers. Can you go into the sequences and find the primers manually? If you blast the sequences, what do they come back as? I wonder if you have PhiX contamination - this is a control that is inserted into the libraries and removed later. Is it possible that the barcode sequence is confusiong BioStrings’ ability to detect the primer? Feel free to post a few sequences that are getting tossed that you think shouldn’t be along with the index and primer sequences.
2. When I work through the mothur workflow to screen sequences (with screen.seqs command), maxambig = 0 removes about half of my sequences
Your individual reads likely do not have Ns in them. It is only after you assemble the reads with make.contigs that the ambiguous base calls will appear. I would wonder what the metrics were on your sequencing. What % were passing filter? What was your %Q30? What version of the sequencing chemistry were you using? Is it possible they used 2x300 cycles to sequence the region? If this is the case, that would be an indicator of why the sequence quality is poor
Feel free to post answers to these questions and we can go from there…
Pat
It looks like the majority of the reads “without primers” do actually have primers in them but they are highly degenerate (> 20% error) which is still concerning. When I run cutadapt to filter out sequences that do not have either a forward or reverse primer (e = 0.2), it removes around 20% of sequences. If I change cutadapt to only a trim a sequence if neither the forward nor reverse primers are present, it removes only about 2-3%. I have included below a couple examples of sequences where neither the R1 nor R2 reads had primers that passed e = 0.2 in cutadapt.
I am not sure exactly what the sequencing facility uses other than they told me it was a “MiSeq 500bp kit run as a 2x251 +barcodes”. I have also attached the multiQC report quality plot (this report was generated before removing sequences without primers).
I am also interested in knowing whether this may be an issue caused by the sequencing facility, or if it is a problem on our end with library prep or something related if you have any thoughts on that.
I never use cutadapt and strongly discourage its use. Use make.contigs instead. Sequences that are assembled after being quality trimmed are of poorer quality than leaving them untrimmed
That plot looks like a bad sequencing run. I’d ask what the percent passing filter was and how much PhiX they loaded. I suspect they ran it like a genome where you can get away with a lower percent PhiX and higher cluster density than you can wiht amplicons.
Okay, good to know on cutadapt. Also, the sequencing facility got back to me and they said “15% phix added and had a % pass filter of 83%”. Is that what it should look like?