I have been working with a set of environmental samples that we submitted to a sequencing facility for Illumina MiSeq sequencing. We sequenced the V4 region using 515F and 806R primers. Since receiving the dataset, we have run into a number of odd things in the dataset.
The forward and reverse primers are only present in about 72% of sequences (searched for in BioStrings), which forces us to remove a large portion of the sequences
When I work through the mothur workflow to screen sequences (with screen.seqs command), maxambig = 0 removes about half of my sequences even though the multiQC report does not indicate that the per base N content should be that high. The report does indicate that some of the R2 reads have about 5% sequences with ambiguous base calls at around 22 bp, right after the reverse primer from 1-20. However, generally, it seems like screen.seqs should not be removing half of all sequences.
> summary.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 246 246 0 2 1
2.5%-tile: 1 291 291 0 4 119929
25%-tile: 1 292 292 0 4 1199290
Median: 1 292 292 0 4 2398580
75%-tile: 1 292 292 1 6 3597870
97.5%-tile: 1 481 481 24 17 4677231
Maximum: 1 502 502 148 251 4797159
Mean: 1 316 316 2 5
# of unique seqs: 4797159
total # of seqs: 4797159
> screen.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table, maxambig=0)
Running command: remove.seqs(accnos=stability.trim.contigs.bad.accnos.temp, count=stability.contigs.count_table)
Removed 2192434 sequences from stability.contigs.count_table.
We confirmed with the sequencing facility that the forward and reverse primers should still be present, but we are unsure how to troubleshoot why only 72% of sequences have primers. Additionally, it is unclear at what step in the process caused several R2 reads to have this ambiguous base and why screen.seqs is removing so many sequences with maxambig=0. We are thinking it might have something to do with library prep, but I have not been able to find information on the internet that indicates what exactly causes this.
1. The forward and reverse primers are only present in about 72% of sequences
It sounds like your sequencing was done from the illumina adapters out through the indices and primers, right? If ~25% of your sequences don’t have the primers, then I wonder what those sequences are that lack the primers. Can you go into the sequences and find the primers manually? If you blast the sequences, what do they come back as? I wonder if you have PhiX contamination - this is a control that is inserted into the libraries and removed later. Is it possible that the barcode sequence is confusiong BioStrings’ ability to detect the primer? Feel free to post a few sequences that are getting tossed that you think shouldn’t be along with the index and primer sequences.
2. When I work through the mothur workflow to screen sequences (with screen.seqs command), maxambig = 0 removes about half of my sequences
Your individual reads likely do not have Ns in them. It is only after you assemble the reads with
make.contigs that the ambiguous base calls will appear. I would wonder what the metrics were on your sequencing. What % were passing filter? What was your %Q30? What version of the sequencing chemistry were you using? Is it possible they used 2x300 cycles to sequence the region? If this is the case, that would be an indicator of why the sequence quality is poor
Feel free to post answers to these questions and we can go from there…
Hi Pat -
It looks like the majority of the reads “without primers” do actually have primers in them but they are highly degenerate (> 20% error) which is still concerning. When I run cutadapt to filter out sequences that do not have either a forward or reverse primer (e = 0.2), it removes around 20% of sequences. If I change cutadapt to only a trim a sequence if neither the forward nor reverse primers are present, it removes only about 2-3%. I have included below a couple examples of sequences where neither the R1 nor R2 reads had primers that passed e = 0.2 in cutadapt.
I am not sure exactly what the sequencing facility uses other than they told me it was a “MiSeq 500bp kit run as a 2x251 +barcodes”. I have also attached the multiQC report quality plot (this report was generated before removing sequences without primers).
I am also interested in knowing whether this may be an issue caused by the sequencing facility, or if it is a problem on our end with library prep or something related if you have any thoughts on that.
Thank you so much!
- Sample sequences:
Control sample sequences:
I never use cutadapt and strongly discourage its use. Use
make.contigs instead. Sequences that are assembled after being quality trimmed are of poorer quality than leaving them untrimmed
That plot looks like a bad sequencing run. I’d ask what the percent passing filter was and how much PhiX they loaded. I suspect they ran it like a genome where you can get away with a lower percent PhiX and higher cluster density than you can wiht amplicons.
Okay, good to know on cutadapt. Also, the sequencing facility got back to me and they said “15% phix added and had a % pass filter of 83%”. Is that what it should look like?
Yeah those values seem typical. And you’re sure it was 2x250 and not 2x300, right?
The read length used for the sequencing run was 251bp+8bp+8bp+251bp (paired end)
This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.