Hi, I’m new to Mothur and I’m therefore using the MiSeqSOP with the tutorial’s dataset to get familiar with the pipeline before playing with my own data. I think I might have spotted a bug on Mothur v.1.33.3 (Windows 32-bit version). Well, at least there’s something that bugs me…In order to follow the SOP I downloaded the silva.seed_v119.align database on the 3rd of November 2014 (2 days ago). The problem appears when I run the following command:
pcr.seqs(fasta=silva.seed_v119.align, start=11894, end=25319, keepdots=F, processors=1)
system(rename=silva.seed_v119.pcr.align silva.v4.fasta)
summary.seqs(fasta=silva.v4.fasta)
Instead of getting this, as the SOP states:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 13424 270 0 3 1
2.5%-tile: 1 13425 292 0 4 374
25%-tile: 1 13425 293 0 4 3740
Median: 1 13425 293 0 4 7479
75%-tile: 1 13425 293 0 5 11218
97.5%-tile: 1 13425 294 1 6 14583
Maximum: 3 13425 351 5 9 14956
Mean: 1.00074 13425 292.977 0.0573014 4.57014
of Seqs: 14956
I get this:
Start End Nbases Ambigs Polymer NumSeqs
Minimum: 1 1150 260 0 3 1
2.5%-tile: 1 13425 292 0 4 376
25%-tile: 1 13425 293 0 4 3753
Median: 1 13425 293 0 5 7505
75%-tile: 1 13425 293 0 5 11257
97.5%-tile: 1 13425 460 1 6 14634
Maximum: 3 13425 1122 5 9 15009
Mean: 1.0006 13424.9 321.419 0.064428 1.78433
of Seqs: 15009
I understand that perhaps the database got 53 sequences bigger between the time you wrote the SOP and the day I downloaded it. So I got rid of the sequences (in theory the 53 additional sequences because 15009-14956=43) that were smaller than 270 and longer than 351 in order to match the SOP output. To do so I ran:
trim.seqs(fasta=silva.v4.fasta, minlength=270, maxlength=351, processors=1)
summary.seqs(fasta=silva.v4.trim.fasta)
which gives me this:
Start End Nbases Ambigs Polymer NumSeqs
Minimum: 1 1150 270 0 3 1
2.5%-tile: 1 13425 292 0 4 312
25%-tile: 1 13425 293 0 4 3114
Median: 1 13425 293 0 4 6227
75%-tile: 1 13425 293 0 5 9340
97.5%-tile: 1 13425 294 1 6 12141
Maximum: 1 13425 351 5 9 12452
Mean: 1 293.056 293.056 0.061115 4.5983
of Seqs: 12452
So it turns out if I get rid of the sequences that in theory make up for the difference between my results and the ones from the SOP I end up getting rid of 2557 sequences (15009 originally, but only 12452 left after trim.seqs), instead of the theoretical 53 (15009 sequence in my silva.seed_v119.align but only 14956 in yours).
Questions are:
- How can that be?
- Should I be worried about using this DB, is this a real bug or something I can ignore?
Many thanks in advance for your help!