How does trim.seqs(logtransform=T) work?

Hello and thanks in advance for any insight/advice offered.

I’m trying to understand the intended use of the logtransform option in trim.seqs().

I expected it to convert all my Q scores (including my designated qwindowaverage=Q) with one of the following equations:

  1. New_Q=log10(Q)
  2. New_Q=10^(-Q/10)

In either case, I expected this option to increase the weight of low Q scores. Thus, it should increase the stringency of any given qwindowaverage threshold.

However, when I tested it on several datasets, logtransform=T yielded more and longer reads with lower apparent quality (note: I’m assessing overall quality with boxplots from FastX).

Here’s an example from my Mothur logs:
With logtransform=T:

mothur v.1.35.0
Last updated: 03/23/2014
...

mothur > trim.seqs(fasta=file.fasta, oligos=file.oligos, qfile=file.qual, maxambig=0, maxhomop=8, flip=F, bdiffs=0, pdiffs=2, minlength=100, qwindowaverage=35, qwindowsize=50, logtransform=T, processors=8)
...
Total of all groups is 3569123
...
  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 100 100 0 2 1
2.5%-tile: 1 168 168 0 4 89229
25%-tile: 1 270 270 0 5 892281
Median:  1 271 271 0 5 1784562
75%-tile: 1 271 271 0 6 2676843
97.5%-tile: 1 271 271 0 8 3479895
Maximum: 1 273 273 0 8 3569123
Mean: 1 261.53 261.53 0 5.44595
# of Seqs: 3569123

With logtransform not set (default=F):

mothur v.1.35.0
Last updated: 03/23/2014
...
mothur > trim.seqs(fasta=file.fasta, oligos=file.oligos, qfile=file.qual, maxambig=0, maxhomop=8, flip=F, bdiffs=0, pdiffs=2, minlength=100, qwindowaverage=35, qwindowsize=50, processors=8)
...
Total of all groups is 2340658
...
  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 100 100 0 2 1
2.5%-tile: 1 106 106 0 3 58517
25%-tile: 1 155 155 0 4 585165
Median:  1 206 206 0 5 1170330
75%-tile: 1 242 242 0 6 1755494
97.5%-tile: 1 271 271 0 8 2282142
Maximum: 1 273 273 0 8 2340658
Mean: 1 197.722 197.722 0 5.07486
# of Seqs: 2340658

Am I misunderstanding the intended function of this option? Is this a bug?

I’ve run a total of 4 different datasets with and without logtransform=T, and all show the same pattern (more reads retained, and the trimmed reads are longer). In contrast, when I filtered without the logtransform option, my read length declines steadily from the minimum (as you can see in the quartile summary above).

If it matters, all 4 datasets are unpaired MiSeq reads (I can’t make contigs because my fwd and reverse don’t overlap).

Also: What I’m really looking for is a way to weight q-scores according to their actual error rate (i.e., equation #2 above). I’ve realized that qwindowaverage can be both stringent and yet potentially ineffective at removing reads with very low q scores. (e.g., The average of q=30 and q=10 is q=20, but q=20 corresponds to a 1% error rate per base, and the average error with a q=30 and q=10 base is really ~5% per base (0.001+0.1)/2=0.05)