pcr.seqs bug? in v.1.33.3

Hi, I’m new to Mothur and I’m therefore using the MiSeqSOP with the tutorial’s dataset to get familiar with the pipeline before playing with my own data. I think I might have spotted a bug on Mothur v.1.33.3 (Windows 32-bit version). Well, at least there’s something that bugs me…In order to follow the SOP I downloaded the silva.seed_v119.align database on the 3rd of November 2014 (2 days ago). The problem appears when I run the following command:

pcr.seqs(fasta=silva.seed_v119.align, start=11894, end=25319, keepdots=F, processors=1)
system(rename=silva.seed_v119.pcr.align silva.v4.fasta)
summary.seqs(fasta=silva.v4.fasta)

Instead of getting this, as the SOP states:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 13424 270 0 3 1
2.5%-tile: 1 13425 292 0 4 374
25%-tile: 1 13425 293 0 4 3740
Median: 1 13425 293 0 4 7479
75%-tile: 1 13425 293 0 5 11218
97.5%-tile: 1 13425 294 1 6 14583
Maximum: 3 13425 351 5 9 14956
Mean: 1.00074 13425 292.977 0.0573014 4.57014

of Seqs: 14956

I get this:
Start End Nbases Ambigs Polymer NumSeqs
Minimum: 1 1150 260 0 3 1
2.5%-tile: 1 13425 292 0 4 376
25%-tile: 1 13425 293 0 4 3753
Median: 1 13425 293 0 5 7505
75%-tile: 1 13425 293 0 5 11257
97.5%-tile: 1 13425 460 1 6 14634
Maximum: 3 13425 1122 5 9 15009
Mean: 1.0006 13424.9 321.419 0.064428 1.78433

of Seqs: 15009

I understand that perhaps the database got 53 sequences bigger between the time you wrote the SOP and the day I downloaded it. So I got rid of the sequences (in theory the 53 additional sequences because 15009-14956=43) that were smaller than 270 and longer than 351 in order to match the SOP output. To do so I ran:

trim.seqs(fasta=silva.v4.fasta, minlength=270, maxlength=351, processors=1)
summary.seqs(fasta=silva.v4.trim.fasta)

which gives me this:
Start End Nbases Ambigs Polymer NumSeqs
Minimum: 1 1150 270 0 3 1
2.5%-tile: 1 13425 292 0 4 312
25%-tile: 1 13425 293 0 4 3114
Median: 1 13425 293 0 4 6227
75%-tile: 1 13425 293 0 5 9340
97.5%-tile: 1 13425 294 1 6 12141
Maximum: 1 13425 351 5 9 12452
Mean: 1 293.056 293.056 0.061115 4.5983

of Seqs: 12452

So it turns out if I get rid of the sequences that in theory make up for the difference between my results and the ones from the SOP I end up getting rid of 2557 sequences (15009 originally, but only 12452 left after trim.seqs), instead of the theoretical 53 (15009 sequence in my silva.seed_v119.align but only 14956 in yours).

Questions are:

  1. How can that be?
  2. Should I be worried about using this DB, is this a real bug or something I can ignore?

Many thanks in advance for your help!

Hi the seed in the SOP is the older version of the seed that only had 14956 sequences. The newer one has more sequences and was actually created in a slightly different manner. You should be ok.

Pat

OK, thanks!