Do I have low quality reads?


I have problems with the analysis of my V4 reads and I am wondering whether it is because of the quality of my reads. I am too inexperienced with Illumina to know what values I could or should expect.

I think I should start from the beginning and I apologize if I might ramble on too much.
I am utilizing the Illumina MiSeq Sequencer (v3 chemistry) I started out sequencing the V3V4 region and I did not manage to analyze the contigs because of a large distance matrix (no surprise there, I know).
I convinced my boss with your blog entry ( to sequence only the V4 region.
(and for the next run, I think I will try and convince him to use the v2 chemistry)

I got those V4 reads back and here are the quality scores of one of the forward and reverse reads as an example of the quality:

My amplicon is about 250 bp long. The quality scores of the forward reads are good and mainly above 25 until position 175. The reverse reads qualities are good until position 100.
In my mind I should have perfectly good reads with maybe some ambiguous bases when I make contigs. However, one third of my reads contain Ns in their sequence.

I started off with 9 Million reads and removing all sequences with ambiguous bases left me with 6 million reads.
Out of those 5978622 sequences 5647880 were unique. This already seemed like an awfully big number to me.
I went on with the analysis and I am now stuck at the chimera.uchime command, because it takes for ages and it seems like it is the whole V3V4 analysis all over again.

My (desperate) questions now:
Are so many reads with ambiguous bases and later on so many unique sequences normal?
Do the low quality reads (especially reverse reads) lead to such values?
Are the quality scores for the reverse reads an exception or do I have to expect that regularly when sequencing with Illumina MiSeq V3 chemistry?

My next step would be to start the analysis again only using the forward reads. Does that seem sensible or is there a better option?
Thank you already!


That blog post also discusses why the V3 chemistry is a disaster. One of the problems with the V3 chemistry is that if you have a 275 nt fragment and generate 250 nt per read then that will inflate the error rate. For whatever reason the error rate on the V3 chemistry really starts climbing around 500 cycles. This is why we are still encouraging people to use the V2 chemistry. Alternatively, you can get by with the V3 chemistry but dial it back to only generate 2x250 cycles.

I’m reluctant to suggest people use only the forward read since the error climbs on those too. The only thing I can suggest with these types of data is to classify the data and use the phylotype approach or regenerate the data.