make.contig tweaks

Yet again, I’m faced with another person in our building coming with to me with those god damn reads using 27F 519R primers with the v3 (2x 300bp chemistry). Yes, I’ve now passed on messages to our sequencing centre and yes to the school as well about not going down this road. ANYWAY, is there a possible solution to tweak make.contigs to better handle this situation. The guys in the sequencing centre said:

"There are some reagent performance issues with the v3 600-cycle kits from Illumina which manifest as poor quality towards the end of each read of a long read run (2x300bp), but particularly affect read 2. The last update I had from them indicated they may have found a cause, but are still investigating. They have not put any usage or purchase holds on these reagents and have not issued any technical bulletins or warning about them either. One reason for this is that there is variability in the performance of the reagents even from the same lot, and runs can still pass their performance parameters, although some more stringent mitigations are now necessary whereas previously these were not much of a concern. These include lowering the cluster density and increasing the PhiX spike in.

and passed on some info that another group was doing;

“The quality of all illumiuna R1 and R2 reads was assessed visually using fastqc [1]. Generally we observed a significant drop in read quality in the last 50-100bp of R2 and the last 10bp of R1. We trim the 5’ end of R1 by 10bp and the 5’ end of R2 by 70bp (we chose to trim as many bp as possible while
still leaving an overlap that allowed reliable merging of R1 and R2 reads. Reads were then merged using FLASH [2]. After merging several hundred sequence were merged manually and the results compared to the FLASH merges to ensure efficacy of FLASH.”

I’m naturally curious and will have a look at this approach. I’m most interested in whether trimming the sequence before attempting to make contigs will aid in not losing ~ 60 % of your sequences after screening contigs!!! :shock: (generally due to at least 1 or more NBases being present) and also keeping in mind that those 40 % that pass, 30 % of them (30 % of the 40 %) are unique :lol: Garbage in garbage out…

Sorry! We haven’t played much more with this. My thinking is that if you trim the reads then you have even less overlap between the reads. I kind of think you’re getting sent on a fool’s errand. It might save money, in the long run, to resequenced with V2 chemistry on the V4 read - of course, that assumes they value your time :expressionless:


There were some slight differences between the output whether I chose to use make.contigs with the fastqs or first chop off the ends (10 bp from the R1 and 75 bp from the R2) before make.contigs. The lengths of raw contigs were different (not shown). For both sets I then;


make.contigs default

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 392 392 0 4 1
2.5%-tile: 1 432 432 0 4 348
25%-tile: 1 454 454 0 5 3471
Median: 1 471 471 0 5 6941
75%-tile: 1 486 486 0 6 10411
97.5%-tile: 1 512 512 0 6 13534
Maximum: 1 548 548 0 12 13881
Mean: 1 471.153 471.153 0 5.14646

of Seqs: 13881

13881 5452 (39.3 %)

Chop before make.contigs

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 402 402 0 3 1
2.5%-tile: 1 434 434 0 4 482
25%-tile: 1 461 461 0 5 4813
Median: 1 483 483 0 5 9626
75%-tile: 1 487 487 0 6 14438
97.5%-tile: 1 510 510 0 11 18769
Maximum: 1 515 515 0 12 19250
Mean: 1 473.839 473.839 0 5.38348

of Seqs: 19250

19250 9822 (51.0 %)

Note that the initial number of sequences was 72430, so you lose about 75 % of your sequences with both attempts. :shock: dodge.