Demultiplexing & undetermined reads (Kozich 2013)

Dear all,

We’ve been implementing the Dual index sequencing approach on a MiSeq (2 x 250 bp) as described in your Kozich 2013-paper and it’s wokring pretty good for us. The only thing that we seem to have is a huge amount of undetermined reads. We get around 33% before sequence processing and still end up with 24% after preprocessing. So my first question is, has anyone else experienced this using this sequencing strategy?

Of course I think I would be able to demultiplex from scratch using the Index files and see if that gives any difference, but then I bumped on a more theoretical question. The MiSeq demultiplexing software has a standard setting of allowing 1 mismatch in the barcode. We could allow more mismatches (if that’s possible with the barcodes) to make sure more sequences are identified to a sample. But how trustworthy are sequences with mismatches in their barcode? I’d guess that the mismatches in the barcodes are less likely to be sequence errors, but more likely errors from library preparation (PCR-level). And since we can find mismatches in the barcode, I think these mismatches also exists in the actual gene sequence. And mismatches in a 250 bp region (V4) where you build OTUs on only 8 bp difference (97% OTU definition) seems completely wrong to me. Long story short, my second question: Is it a good idea to try and demultiplex using 1 or 2 mismatches in barcode, or should this best be kept at 0 since the sequences are not trustworthy?


I’m not sure what you mean by “undetermined”. You mean the ones that aren’t assigned to a pair of indices? If that’s the case then these tend to be the PhiX sequencing control (what % PhiX are you loading?). Also, if you have a less than stellar run, a larger number of reads will go in there because the reads are bad. I would not go above their defaults for deconvoluting the sequences to groups.


Indeed, that’s what I mean with undetermined reads. I haven’t thought about the PhiX, we usually load 10% and that might indeed explain some of the sequences in there. But after preprocessing, including aligning with the SILVA database and removing chloroplasts, eukaryotes, …, we still have around 24% of preprocessed reads in that group. If it was PhiX wouldn’t that be removed in these preprocessing steps as it won’t map to SILVA?

I agree in not going above the defaults, but what’s your opinion about reducing it to absolutely no mismatches in the barcodes?

Thank you for the reply

I think going with no mismatches is probably overly stringent. The data might get scrapped because of a low quality run. When we have accidentally overloaded the DNA on the chip we see a similar problem.


Thank you Pat for your input!