adapt MiSeq SOP for MiSeq V3 kit

Dear all,

I am trying to run the Illumina MiSeq SOP on Illumina MiSeq V3 data I got back from an external sequencing center (LGC Genomics GmbH) with 300 bp paired-end reads (2*300 bp, http://www.illumina.com/products/miseq-reagent-kit-v3.ilmn) on 16S with primers 341F-785R and I appear to be running into some differences with the current SOP where I would like to get some input on.

For starters, but not entirely unrelated, I got the data in several different forms from their on-site preprocessing:

  1. AdapterClipped: folder where the Illumina TruSeq adapters were clipped and reads <100 bases were discarded
  2. RAW: reads sorted by inline barcodes, where no mismatches were allowed and the barcode was clipped after sorting (reads with missing or one-sided barcodes or conflicting barcode pairs were discarded)
  3. PrimerClipped: the primers were detected and removed (2 mismatches were allowed, pairs had to be present), seqences were turned in to Fwd-Rev primer orientation
  4. Combined: combination of Fwd and Rev reads with FLASh http://ccb.jhu.edu/software/FLASH/ (minimum overlap 10 bases, maximum mismatch rate:25%)

I chose to work with folder 3 (the “PrimerClipped” data), as I feel this one is closest related to the input files we take for the current SOP. I would like to work with the ‘combined’ sequences but I’m not really sure how to plug this data into the current SOP ( :?: I suppose I would need to use merge.files, but how would a groups file be generated then :?: )

I am starting with 1.606.174 sequences, with lengths ranging from 290 till 548 NBases. Here I ran into my first question: how would I set the maxlength now? As most of my sequences vary between 403 and 427 bases, I chose to set the maxlength at 445 in the first screen.seqs, and of course the number of ambiguities was set to 1, which already reduces my total number of sequences to 1.030.909, which is removing approx. 1/3rd (35%) of the dataset. In the SOP only 15% of the data gets removed at this step. I was wondering if I chose my maxlength wrong? Does anybody have any experience how to choose this parameter or if this big loss is normal for MiSeq V3 data?

Next on the pcr.seqs I was very foggy on how to determine the exact position on which my sequences start and end in the silva database, hence I skipped this step because, as I understand, it’s main reason is to increase computational efficiency and computing power is not really an issue on my part.

Hence I aligned against the full silva.bacteria.fasta set and found out that most of the sequences could be found between positions 6428 and 23440, after which I again screened the sequences with the maxhomop=8 and start and ends set to the results of the alignment. This resulted again in only a 840.151 sequences remaining (18% reduction of previous data or 48% from the original set as opposed to an only 15% reduction from the original set at this point in the SOP).

For the pre.cluster command I used diffs=4 instead of 2 as the rule of thumb states 1 diff per 100 bases, and my sequences are 400 bases at least.
Furthermore, after chimera.uchime the number of total sequences is further reduced to 803074 which is quite acceptable (5% overall chimeras) but still this means that I really lose 50% of the data.
Finally, after removing unwanted taxa I end up with 802.104 sequences.

Overall, what worries me most is that I lose over 50% of my data and don’t really know how to “tweak” the parameters to mitigate this loss.

Any input on this matter would be warmly welcome :slight_smile:

Thanks in advance,

FM Kerckhof

Hi there,

We’ve had very limited success in getting the v3 kit to work as well as the v2 kit for 16S. A critical component to what we do is the requirement that the reads fully overlap and that a mismatch is cured by looking at the difference in quality score for the two bases. You are probably losing data because of this. If you don’t follow these suggestions you will get a lot of crappy data. So you can do all sorts of tweaking, but you’re just going to be allowing in more bad data. I’d encourage you to take a look at the data in the Kozich AEM 2013 paper that describes our approach.

What we’ve found to work with the V3 kit is to reduce the cluster density. Of course this ends up with a similar number of reads to the V2 kit with an extra charge.

Pat

Pat,

Thanks for your input. I’ll take a good look at the paper.
The choice of the kit was not really mine but the one from the sequencing facility hence there’s not much I can do about that :frowning:

To adress the issue of good overlaps: I think it might be a good idea to use the already combined datasets. But how would I do this in Mothur? The annoying thing of course is that they are allready demultiplexed and primers and barcodes were clipped of :?

Kind regards,

FM

Hi there,

With the help of a colleague, I figured out how to use the pre-combined sequences which I got from the sequencing company
(LGC genomics Gmbh combined the reads with FLASh 1.2.4 with a minimum overlap of 10 bases and a maximum mismacht rate of 25%) in mothur. Basically, using the make.group file with a group for each sample, followed by merge.files I can pickup the MiSeq SOP at the screen.seqs() command, and I am able to do all steps and still have 85% of the data left.

However, I get into trouble at the stage where I have to bin my OTUs: after 6 hours of running on 12 cores dist.seqs creates a 70.6 Gb big distance matrix, even with a cutoff of only 0.10 (!). The next step is, of course, the clustering and I would like to get some input on that: would it be the best idea here to use the “cluster” command with a pre-defined cutoff of 0.03 or “cluster.split” with large set to true?

editUsing cluster.split(fasta=current, count=current,taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.03, hard=t, large=t), I was able to bin my OTUs, only I got this weird message: “Cutoff was 0.035 changed cutoff to 0.02” I understand this is some sort of FAQ and the mothur cluster command does not give distances for which the clustering does not changes but does this mean then that my OTUs are actually binned at 98% similarity? Given the read length, what are the phylogenetic implications for that (OTU +/- = species or not anymore?). Given that I had set hard to true, I am surprised to see this happening though…


Thanks in advance for your reply.

FM

The long runtimes and headaches is happening because of the elevated sequencing error rate induced by using the V3 chemistry. It seems you are stuck with 0.02 otus and should expect more of those than you would get at 0.03.

Pat

Thanks for your input Pat,

I know it is not ideal and I would love to have done v2 Illumina but basically the data is generated and paid for and it’s my job to make sense of it now.
So I’m doing the best I can given the tools I have. But indeed, the 0.02 gives me about 103929 OTUs which is just not workable for any downstream analysis.
I am guessing I will have to find alternative methods for OTU binning.


Kind regards,
FM

I suspect the powers that be will actually save money if they resequence the samples vs. having you make sense of high error prone data. Although that assumes they value your time and the cost of hardware and CPU time :slight_smile:

:mrgreen: For them an analist is just somebody that makes nices plots that should’ve been done yesterday… Thanks for your take on the things anyway, I’ll spread the word.
I got word back from LGC by the way. Apparently the early 2014 MiSeq v3 kits were low quality, but that seems to be resolved by now and they use mothur for OTU picking themselves regularly, although it is not entirely clear to me how they do it yet. I am having a very nice bidirectional communication with them on how to go forward… I hope it will.

Kind regards and thank you for your time!

I feel your pain. You inspired me to write this today…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Not so sure that the chemistry has gotten better, this plot from V3 data was from a few weeks ago

Marvelous post Pat! Thanks. I hope it will change some things around here and for other labs.

Hi Pat,

Amazingly, I am still wasting my time on this analysis and my PI didn’t kill me yet :mrgreen:. I tried mothur with several other settings (phylotype, …) and some spurious OTU filtering in R afterwards but got horrible rarefaction curves. I also tried to use UParse to get my OTU table, but UParse quality filtering kicked out about 90% of the data, even at the highest noise settings.

Hence I was just wondering about what you wrote in your blog: “Resequence the data and do it right. If you can’t find someone to do it right, let me know and I can hook you up with people that can.” Could you hook me up with some people that can because all communication with LGC is dead right now ( :cry: ), and I am really not going to keep on wasting my time on this. Even if I really can’t convince my PI and have to pay out of my own pocket I will get this done right.

Thanks in advance for a swift reply :slight_smile: .

Shoot me an email and I can get you in touch with the person you need to talk with.

Pat