Dear all,
I am trying to run the Illumina MiSeq SOP on Illumina MiSeq V3 data I got back from an external sequencing center (LGC Genomics GmbH) with 300 bp paired-end reads (2*300 bp, http://www.illumina.com/products/miseq-reagent-kit-v3.ilmn) on 16S with primers 341F-785R and I appear to be running into some differences with the current SOP where I would like to get some input on.
For starters, but not entirely unrelated, I got the data in several different forms from their on-site preprocessing:
- AdapterClipped: folder where the Illumina TruSeq adapters were clipped and reads <100 bases were discarded
- RAW: reads sorted by inline barcodes, where no mismatches were allowed and the barcode was clipped after sorting (reads with missing or one-sided barcodes or conflicting barcode pairs were discarded)
- PrimerClipped: the primers were detected and removed (2 mismatches were allowed, pairs had to be present), seqences were turned in to Fwd-Rev primer orientation
- Combined: combination of Fwd and Rev reads with FLASh http://ccb.jhu.edu/software/FLASH/ (minimum overlap 10 bases, maximum mismatch rate:25%)
I chose to work with folder 3 (the “PrimerClipped” data), as I feel this one is closest related to the input files we take for the current SOP. I would like to work with the ‘combined’ sequences but I’m not really sure how to plug this data into the current SOP ( :?: I suppose I would need to use merge.files, but how would a groups file be generated then :?: )
I am starting with 1.606.174 sequences, with lengths ranging from 290 till 548 NBases. Here I ran into my first question: how would I set the maxlength now? As most of my sequences vary between 403 and 427 bases, I chose to set the maxlength at 445 in the first screen.seqs, and of course the number of ambiguities was set to 1, which already reduces my total number of sequences to 1.030.909, which is removing approx. 1/3rd (35%) of the dataset. In the SOP only 15% of the data gets removed at this step. I was wondering if I chose my maxlength wrong? Does anybody have any experience how to choose this parameter or if this big loss is normal for MiSeq V3 data?
Next on the pcr.seqs I was very foggy on how to determine the exact position on which my sequences start and end in the silva database, hence I skipped this step because, as I understand, it’s main reason is to increase computational efficiency and computing power is not really an issue on my part.
Hence I aligned against the full silva.bacteria.fasta set and found out that most of the sequences could be found between positions 6428 and 23440, after which I again screened the sequences with the maxhomop=8 and start and ends set to the results of the alignment. This resulted again in only a 840.151 sequences remaining (18% reduction of previous data or 48% from the original set as opposed to an only 15% reduction from the original set at this point in the SOP).
For the pre.cluster command I used diffs=4 instead of 2 as the rule of thumb states 1 diff per 100 bases, and my sequences are 400 bases at least.
Furthermore, after chimera.uchime the number of total sequences is further reduced to 803074 which is quite acceptable (5% overall chimeras) but still this means that I really lose 50% of the data.
Finally, after removing unwanted taxa I end up with 802.104 sequences.
Overall, what worries me most is that I lose over 50% of my data and don’t really know how to “tweak” the parameters to mitigate this loss.
Any input on this matter would be warmly welcome
Thanks in advance,
FM Kerckhof