I may have something to contribute to this. I’ve actually been trying to replicate what Kunin et al did, and in my typical style I will just throw out a bunch of outputs:
mothur > summary.seqs(fasta=data.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 46 46 0 2
2.5%-tile: 1 55 55 0 3
25%-tile: 1 261 261 0 4
Median: 1 263 263 0 4
75%-tile: 1 269 269 0 4
97.5%-tile: 1 289 289 1 6
Maximum: 1 306 306 3 8
of Seqs: 4092
mothur > trim.seqs(fasta=data.fasta, qfile=data.qual, qthreshold=27, minlength=180, maxlength=300, oligos=data.oligos)
mothur > summary.seqs(fasta=data.trim.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 180 180 0 3
2.5%-tile: 1 186 186 0 3
25%-tile: 1 197 197 0 4
Median: 1 216 216 0 4
75%-tile: 1 224 224 0 4
97.5%-tile: 1 243 243 0 4
Maximum: 1 248 248 0 6
of Seqs: 758
This nearly gave me a heart attack. WIth my remaining strength I grabbed LUCY http://softlayer.dl.sourceforge.net/project/lucy/lucy/lucy%201.20/lucy1.20.tar.gz and used that to do the end-trimming instead.
Here’s the command I used:
lucy -error 0.002 0.002 data.fna data.qual (the first 0.002 is equivalent to Mothur’s qaverage=27, and the second one is supposed to be the same as qthreshold=27. Also, LUCY doesn’t like random empty lines in the files).
And this was the output:
Phase 1: Checking the number of input sequences…
4092 input sequences found.
Phase 2: Reading sequence name, length and position data…
0 empty sequences discarded.
Phase 3: Reading quality values and checking good sequence regions…
4092 quality sequences read; 653 low quality sequences dropped.
Phase 4: 2nd sequences comparsion skipped…
no 2nd sequence file provided.
Phase 7: Producing output sequence and quality files…
3439 good sequences written in lucy.seq and lucy.qul.
I then proceeded to do this:
mothur > trim.seqs(fasta=lucy.seq, minlength=180, maxlength=300, maxambig=0, oligos=IVSCS.oligos)
mothur > summary.seqs(fasta=lucy.trim.fasta)
Start End NBases Ambigs Polymer
Minimum: 1 183 183 0 3
2.5%-tile: 1 221 221 0 3
25%-tile: 1 238 238 0 4
Median: 1 239 239 0 4
75%-tile: 1 243 243 0 4
97.5%-tile: 1 261 261 0 5
Maximum: 1 278 278 0 6
of Seqs: 2745
Notice that some sequences with N remained, so I had to trim them as well.
The big questions is–why do I get such different outputs from LUCY and Mothur? Note that the file used in the example was ran with FLX chemistry. I tried to do the same thing with some Titanium data I have (from multiple runs done at different facilities), and both Mothur and LUCY proceeded to discard >99% of the reads. I think someone has to repeat the analyses in the Kunin et al paper for Titanium chemistry…