mothur

Additional quality filtering in make.contigs, truncate option in screen.seqs

Well I had this whole lovely post written and had to re-sign in and it was deleted… Anyway, in short:

I would like to see additional quality checking in/before make.contigs. If you have fully overlapping reads and are using make.contigs to error correct, wouldn’t it make sense to filter out low quality reads before making contigs with them? Not the insert option, but more similar to the qaverage/qwindow options in trim.seq, but with the ability to handle paired reads and chuck both forward and reverse reads if one doesn’t pass, then you’re not error correcting with overall low quality sequences.

I would also like to see a truncation option in screen.seqs. The premise: it’s been suggested as few as 75bp could work in identifying an organism with the 16s gene (http://www.pnas.org/content/108/Supplement_1/4516). It would need to be tested further (and I personally wouldn’t go as short as 75bp), but why use/process 200bp of sequence if 175bp could give you the identification? 25bp X Nmillion seqs = a lot of processing that could be saved.

-Ashley

I would like to see additional quality checking in/before make.contigs. If you have fully overlapping reads and are using make.contigs to error correct, wouldn’t it make sense to filter out low quality reads before making contigs with them? Not the insert option, but more similar to the qaverage/qwindow options in trim.seq, but with the ability to handle paired reads and chuck both forward and reverse reads if one doesn’t pass, then you’re not error correcting with overall low quality sequences.

We’ve actually found that none of those options makes a difference. The key is that reads must fully overlap with each other. You can see the details in our Kozich et al paper from AEM. The fact is that both reads are bad and they are both needed to denoise each other.

I would also like to see a truncation option in screen.seqs. The premise: it’s been suggested as few as 75bp could work in identifying an organism with the 16s gene (> http://www.pnas.org/content/108/Supplement_1/4516> ). It would need to be tested further (and I personally wouldn’t go as short as 75bp), but why use/process 200bp of sequence if 175bp could give you the identification? 25bp X Nmillion seqs = a lot of processing that could be saved.

We do have a chop.seqs command that you could do this with.

I have read the paper, and I understand the value in having fully overlapping reads and using those to error correct base calls at any given position, and the algorithm makes sense. I don’t really see how this connects to quality filtering paired reads before error-correcting though. An example: if you have two completely overlapping reads that look like this once you reverse complement the reverse read:

Real sequence: ATCTCGAACAATA

Fwd: ATCTCGAACTATA
Rvs: ATCTCGAACAATA

The difference is underlined. If the A in the reverse read is at least 6 points better than the T, the sequence is corrected at that position to A and it is error corrected. If they are less than 6 points different, it is not corrected but set to N. If the user goes by convention and does not allow any ambiguous bases, then this read pair is chucked, which would negate the need to filter before error correcting.

However, if for whatever reason (and it happens) you just had a really bad set of paired reads:

Fwd: ATCTCGATCAATC
Rvs: TTCACGAAGATTC

If by chance, any of these bases are below the 6 point threshold, you have your N and this terrible read pair is chucked. But if they are all greater than 6 points, you’ve just generated a spurious sequence.
Now, likely these terrible reads would have lower overall quality scores for reasons outlined in the Kozich paper: (“To explore whether the quality scores could be used as a reliable surrogate for sequence quality, we categorized each quality score by whether its base call was the expected base (i.e., a match), a substitution, an insertion, or an ambiguous base (Fig. 2). As expected, errors were associated with low-quality scores and rarely had a quality score above 21.”)
Filtering out this pair based on quality score before error correcting would make this a non-issue and reduce spurious sequences.

Have you ever played around with chop.seqs to see how it influences downstream steps?

Hi,

Yes, I’ve played around with myriad different quality control settings and none of them were better than the consensus approach we used in Kozich.

Pat

To my understanding, “chop.seqs” is to trim fasta files after mergeing the reads. Is there an option to chop fastq files before mergeing?

Thanks!