Silva.seed_v119.align

Hello
I have another question about silva.seed_v119.align (I’m on Mothur 1.33.3).
For the pcr.seqs step, I know my primers are at position 23447 and 43116 so here is what I did:

pcr.seqs(fasta=silva.seed_v119.align, start=23447, end=43116, keepdots=F, processors=1)
summary.seqs(fasta=silva.seed_v119.pcr.align)
and I get this:
Start End Nbases Ambigs Polymer NumSeqs
Minimum 1 19356 661 0 4 1
2.5%-tile 511 19669 985 0 4 376
25%-tile 511 19669 702 0 5 3753
Median 511 19669 704 0 5 7505
75%-tile 511 19669 709 0 6 11257
97.5%-tile 511 19669 760 2 7 14634
Maximum 1985 19669 1450 5 10 15009
Mean 510.997 19669 712.157 0.17243 5.36984

of seqs 15009

My questions are:

  1. why is it that most sequences start at 511 and only a minority start at 1? It could be explained if there were some remaining dots but there shouldn’t be (keepdots=F).

  2. should I just trim my new DB of anything that starts before 511 by doing: pcr.seqs(fasta=silva.seed_v119.pcr.align, start=511, end=43116, keepdots=F, processors=1) and trust that I can use the resulting DB or is there something wrong with the DB?

In order to understand what was going on I also did this (using the original silva.seed_v119.align downloaded from the SOP):

pcr.seqs(fasta=silva.seed_v119.align, start=20000, end=43116, keepdots=F, processors=1)

and I got this:
Start End Nbases Ambigs Polymer NumSeqs
Minimum 45 20803 729 0 4 1
2.5%-tile 45 21116 753 0 4 376
25%-tile 46 21116 770 0 5 3753
Median 46 21116 772 0 5 7505
75%-tile 46 21116 777 0 6 11257
97.5%-tile 46 21116 828 0 7 14634
Maximum 47 21116 1579 2 10 15009
Mean 46.00001 21116 780.2 5 5.37278

of seqs 15009 0.18842

Question:
Why do I get different end positions depending on the specified start position and why does it start at 45 (and not 1) in this last command? Is it a normal inconsistency, am I missing something in the pcr.seqs process or there something weird with the DB?

Thanks a lot for any answer!

Best

The keepdots=F parameter removes the parts of the sequences from 0 to 20000 and 43116 to the end. There may still be dots in the sequence if there are not bases found at position 20000 and 43116. This is done to preserve the alignment. The summary.seqs command lists the start and end positions as the first position a base appears. In other words, even though the pcr.seqs command trimmed your sequences to spot 20000, it didn’t start until position 20045. Make sense?