Bug in trim.seqs

I’m getting some odd behavior when using trim.seqs with the qfile arguement when it’s not actually needed. Again I think an example will be more clear:

First I trim the sequences using average Q score as a threshold. Mothur didn’t throw up any mistakes so I’m assuming that qual file looked okay.

mothur > trim.seqs(fasta=Data.fasta, qfile=Data.qual, qaverage=25)
mothur > system(mv Data.trim.fasta Data.Q25.trim.fasta)
mothur > summary.seqs(fasta=Data.Q25.trim.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 30 30 0 1
2.5%-tile: 1 34 34 0 2
25%-tile: 1 94 94 0 3
Median: 1 97 97 0 3
75%-tile: 1 97 97 0 3
97.5%-tile: 1 97 97 1 4
Maximum: 1 447 447 14 31

of Seqs: 151527

I then try to trim the sequences using some additional criteria which don’t actually require the qual file, but I left the qfile argument in anyway.

mothur > trim.seqs(fasta=Data.Q25.trim.fasta, qfile=Data.qual, maxambig=0, minlength=88, maxlength=107)
mothur > system(mv Data.Q25.trim.trim.fasta Data.Q25.QC.trim.fasta)
mothur > summary.seqs(fasta=Data.Q25.QC.trim.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 88 88 0 2
2.5%-tile: 1 91 91 0 2
25%-tile: 1 96 96 0 3
Median: 1 97 97 0 3
75%-tile: 1 97 97 0 3
97.5%-tile: 1 97 97 0 3
Maximum: 1 107 107 0 7

of Seqs: 117266

Now I try the same thing but take the qfile argument out.

mothur > trim.seqs(fasta=Data.Q25.trim.fasta, maxambig=0, minlength=88, maxlength=107)
mothur > system(mv Data.Q25.trim.trim.fasta Data.Q25.QC.trim.fasta)
mothur > summary.seqs(fasta=Data.Q25.QC.trim.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 88 88 0 2
2.5%-tile: 1 91 91 0 2
25%-tile: 1 96 96 0 3
Median: 1 97 97 0 3
75%-tile: 1 97 97 0 3
97.5%-tile: 1 97 97 0 3
Maximum: 1 107 107 0 7

of Seqs: 120644

I randomly picked a few sequences in the fasta and qual files to check that their length actually match, and as far as I can see, they did. Is anyone else able to replicate this behavior?

If you open the qual file in a text editor you’ll see that you get as many qual scores per sequence as the same, untrimmed sequence is long. So when you give the trimmed file and the qual file, weird things are bound to happen.

So does that mean any trimming related to quality (i.e., qaverage and qtrim) need to be done before other types of trimming. I’m not sure I understood you 100%. Thanks.

you can do everything together:

mothur > trim.seqs(fasta=Data.fasta, qfile=Data.qual, qaverage=25, maxambig=0, minlength=88, maxlength=107)

should yield…

Start End NBases Ambigs Polymer
Minimum: 1 88 88 0 2
2.5%-tile: 1 91 91 0 2
25%-tile: 1 96 96 0 3
Median: 1 97 97 0 3
75%-tile: 1 97 97 0 3
97.5%-tile: 1 97 97 0 3
Maximum: 1 107 107 0 7

of Seqs: 120644

Oh I know that. I was trying to see how many sequences each of the criteria trimmed, so I did it in multiple steps. Since the current behavior deviates from what people may expect, perhaps it would be a good idea to document it?

yes indeed, these are pyrotags, at the 3’ end only. I don’t know how I got some forward and some backward. Perhaps that happened before I saw the data–someone else did the original QC.