I am updating my benchmark tests of paired read assemblers, and I found that Make.contigs() appears to have a 100% false positive rate for assembling random sequences. This suggests that it may have a high false positive rate in practice for pairs that do not have any true overlap (quite common e.g. with ITS).
I used mothur v.1.36.1.
The test works like this: I generate sets of forward & reverse reads with unrelated random sequences, then I check whether the assembler makes contigs. None of the other assemblers I’ve tested make any contigs, except for PANDAseq which assembles ~70%.
A typical test set with runme.bash script & output files in this tarball:
In this set, all the Q scores are set to 40 to ensure that any alignment mismatches should be taken seriously.
In the Make.contigs() wiki page, it says the that the assembled qual scores cannot be used for quality filtering. I agree that the PANDAseq method gives incorrect posterior Q scores, but then I’m curious why you report them – is there anything they can be used for?
Thanks for posting your data and test set. Admittedly, the use case for make.contigs is where the reads overlap and has not been tested for ITS. I would probably argue that any analysis based on reads that don’t overlap is garbage and as a reviewer would reject any analysis based on such an approach. We know that real reads do not have perfect quality scores all the way across and actually the second read really sucks compared to the first. The thought of using reads without any overlap is pretty horrifying to me. People with ITS data probably really want to be careful picking sub regions with the ITS region or should use Sanger or PacBio.
Regardless, when I run your data through summary.seqs we see that in your test case 19 of the 100 sequences the resulting reads have an N and those sequences would be discarded in the next step with the use of screen.seqs. Next, of the 19 reads without an N, they only overlap by 1-7 nt (see fwd.contigs.report). Again, in the screen.seqs step people would be encouraged to reject those sequences since they don’t overlap with each other. Furthermore, at least for 16S subregions, we have a good idea of how long the sequence data should be. For example, the V4 region is rarely if ever longer than 260 nt. So we again encourage people to chuck the contigs longer than the expected length. I would say that the output of make.contigs taken in context (it was never really designed to be used on its own) would result in no false positives. Perhaps you’d be more satisfied if we put the screen.seqs parameters into make.contigs, but we already have those options in screen.seqs so we didn’t feel like being redundant.
As for the quality scores, you can probably search through this forum and my mailbox and find countless people begging for us to include quality scores and me saying they’re meaningless on a contig with partial overlap. Even the pandaseq people seem to think they’re pretty meaningless. At some point we give people what they want even if we think it’s garbage. There’s a lot of stuff like that in mothur (see the plethora of chimera detection tools), but again, in context and with our recommendations, the commands make sense.
Hope this makes sense.
Thanks for the detailed comments.
I agree that if people follow your SOPs, this may be a non-issue & I added this point to the benchmark page (http://drive5.com/usearch/manual/merge_bench.html). Though I suspect people often pick out commands from our tools and use them in ways we would not recommend.
Re. ITS, on review looks like my memory was wrong – in fact, the length variation of ITS1 & 2 is not so bad that you would get a mix of overlapping and non-overlapping pairs with well-designed primers. So your referees are right and I reject my comment.
I disagree that Q scores are meaningless with partial overlap. Q scores represent error probabilities. If you have two independent observations of the same base, then the error probability can be adjusted appropriately. The math for this is in Edgar & Flyvbjerg 2015 (Pubmed ID 26139637). None of the existing paired read assemblers I know of get this right except for SHERA (hard to find & no longer maintained) and USEARCH. Intuitively, it is clear that if you have a match the Q score should increase to reflect that you have more confidence that the base call is correct, and with a mismatch the Q score should decrease. PANDseq gets this badly wrong – it usually reduces Q scores when there is a match so good contigs have lower quality than the original read pair.
You can use (correctly) updated Q scores to quality-filter the contig, and I would argue that this is the best way to exploit the improved error estimates enabled by overlapping paired reads. The best way to filter is to calculate the expected number of errors / most probable number of errors from the Q scores and set a maximum. Of course this assumes the Q scores in the original reads are accurate, but empirically Illumina does a good job. For example, you’re right that the R2s are usually worse, but the basecaller software detects this and gives lower Q scores that correlate almost as well with error rates as the R1s. Pubmed 26139637 shows empirically that this filtering strategy dramatically reduces the tail of “harmful” contigs with >3% base call errors that would induce spurious OTUs.
Re. filtering by Ns, I was surprised only 19 out of the 100 FP contigs had any Ns. That’s why I sent a test case where all the bases are Q40. I didn’t find a description of the rule you use, but if you have a mismatch where both bases are Q40 then the probability that one of the base calls is correct is ~1/2 and that would seem to be a strong case for an N (or perhaps better setting the posterior Q=3 which is an error probability of 1/2). If you replace the reverse Q scores with Q=2 (#) then you get no Ns, so the N filtering will not always work.