I have just received my first Illumina data (Have previously analyzed 454 data). The problem is that the sequencing center went crazy and send back +50 millions reads (approx. 500 bp.) of the V1-V3 region (after the make.contigs command) from 21 samples using 2X300bp chemistry. Although I’m running Mother on a cluster with 24 cpus and 256gb., I can imagine that the downstream analysis may be a big problem resulting in enormous file sizes that can not easily be handled. Does anyone of you have experiences what to do with such large amount of reads? One easy way to get around the problem of course is to reduce the number of reads before analyzing.
Have you read about the poor results with v1-3 on 2x300 MiSeq? I’d set your stringency really high for the screen.seq/filter.seqs steps, then if you still have way too many you could subsample down to some number per sample.
I’m confused how you got 50M reads, that’s ~3 v3 miseq runs?
I haven’t read about the poor quality of the V1-V3 with 2X300 MiSeq. However we used GATC Biotech for the sequencing and they prefer this regions, which is also the best region for oral cavity samples. I’m also a bit confused about the number of sequences. Especially when the header of each sequence starts with “HISEQ_483…” indicating that HISEQ was used, which is quite strange considering the length of each reads was 300. Unfortunately I haven’t talked to the sequencing center yet, as I just received the data this morning.
I just contacted my sequence supplier (GATC biotech) about the issues. They say, because of the problems Illumina have had with the V3 chemistry for MiSeq, they have developed a protocol for running 300bp PE on the HiSeq. Have any of you heard of this before, and if so what is the error rate compared to the MiSeq?
you have 2x150 hiseq reads rather than 2x300 miseq? No idea why GATC does v1-3 (regardless if it’s miseq or hiseq), I think that’s the wrong approach for microbiome sequencing. How many samples do you have? You make want to strongly consider redoing them with just v4 and sequence on a 2x250 miseq run.
I only have 21 samples, and I might consider to resequence them. However, I do not really follow you on the 2X150bp, since each of the PE reads is actually 300 bp and the assembled sequence is around 500 bp. as expected using the 27F and 534R primers.
I thought the same. However using the the make.contigs command the reads concatenated nicely, so somehow it seems that they have managed to do 2X300 bp. PE on the HiSeq.
look at your contigs, I bet they aren’t actually overlapped but rather just stuck together with a couple of NNN in the middle (or at least that’s how Pandaseq handles non-overlapping R1/R2)
That does not seems to be the case. I’ve just aligned them manually in Mega and the overlap perfectly. Here is the first four reads try yourself. I very interested to hear if you get something else.
You’re right those are 2x300 which can’t be hiseq (at least not by any published method I’ve seen) but the 50M seqs implies hiseq. You should have a long chat with the seq facility to really figure out what’s going on.