End trimming pyrosequences

A recent paper by Kunin and all in EM suggests that end trimming sequences with a quality score of <27 drastically reduces the error rate in pyrosequence analysis. Could you add an end-trim feature to trim.seqs such that low quality bases at the end of the sequence are trimmed at a specific threshold? I think that the qthreshold option kills the whole sequence, but I would just like the end trimmed.


In version 1.6, if you set the qthreshold=27, mothur will trim the end of the sequence if it drops below 27 and put it in the .trim file. In version 1.7, releasing later this month, we have changed this so if a sequence falls below the qthreshold it gets scrapped, unless the qtrim parameter is set to true. If qtrim=true then it will behave like 1.6.

I may have something to contribute to this. I’ve actually been trying to replicate what Kunin et al did, and in my typical style I will just throw out a bunch of outputs:

mothur > summary.seqs(fasta=data.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 46 46 0 2
2.5%-tile: 1 55 55 0 3
25%-tile: 1 261 261 0 4
Median: 1 263 263 0 4
75%-tile: 1 269 269 0 4
97.5%-tile: 1 289 289 1 6
Maximum: 1 306 306 3 8

of Seqs: 4092

mothur > trim.seqs(fasta=data.fasta, qfile=data.qual, qthreshold=27, minlength=180, maxlength=300, oligos=data.oligos)
mothur > summary.seqs(fasta=data.trim.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 180 180 0 3
2.5%-tile: 1 186 186 0 3
25%-tile: 1 197 197 0 4
Median: 1 216 216 0 4
75%-tile: 1 224 224 0 4
97.5%-tile: 1 243 243 0 4
Maximum: 1 248 248 0 6

of Seqs: 758

This nearly gave me a heart attack. WIth my remaining strength I grabbed LUCY http://softlayer.dl.sourceforge.net/project/lucy/lucy/lucy%201.20/lucy1.20.tar.gz and used that to do the end-trimming instead.

Here’s the command I used:

lucy -error 0.002 0.002 data.fna data.qual (the first 0.002 is equivalent to Mothur’s qaverage=27, and the second one is supposed to be the same as qthreshold=27. Also, LUCY doesn’t like random empty lines in the files).

And this was the output:

Phase 1: Checking the number of input sequences…
4092 input sequences found.
Phase 2: Reading sequence name, length and position data…
0 empty sequences discarded.
Phase 3: Reading quality values and checking good sequence regions…
4092 quality sequences read; 653 low quality sequences dropped.
Phase 4: 2nd sequences comparsion skipped…
no 2nd sequence file provided.
Phase 7: Producing output sequence and quality files…
3439 good sequences written in lucy.seq and lucy.qul.

I then proceeded to do this:

mothur > trim.seqs(fasta=lucy.seq, minlength=180, maxlength=300, maxambig=0, oligos=IVSCS.oligos)
mothur > summary.seqs(fasta=lucy.trim.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 183 183 0 3
2.5%-tile: 1 221 221 0 3
25%-tile: 1 238 238 0 4
Median: 1 239 239 0 4
75%-tile: 1 243 243 0 4
97.5%-tile: 1 261 261 0 5
Maximum: 1 278 278 0 6

of Seqs: 2745

Notice that some sequences with N remained, so I had to trim them as well.

The big questions is–why do I get such different outputs from LUCY and Mothur? Note that the file used in the example was ran with FLX chemistry. I tried to do the same thing with some Titanium data I have (from multiple runs done at different facilities), and both Mothur and LUCY proceeded to discard >99% of the reads. I think someone has to repeat the analyses in the Kunin et al paper for Titanium chemistry…

hmm… could you email me the fasta, qual, and oligo file? one possibility is that a bad base at the beginning of the sequence would scrap the entire sequence in mothur…

Yeah that is possible since LUCY’s end-trim function didn’t get rid of all the unresolved bases in the data file. I was quite surprised when Kunin et al managed to recovered >70% of the sequences using a min Q27 trim since my previous experience told me that Q20 is very common in almost all reads. It would be really messed up if their whole pipeline depended on a bug in LUCY…

you might follow up with phil hugenholtz…

Will try to, but I leave for a field trip in two days, so this will probably have to wait until after Christmas.

Regardless of how it actually works, the Kunin paper established their guidelines based on empirical evidence, so they are still applicable in the real world. It would be interesting to find out how LUCY’s end trim function works though.