scree.seqs issue with start and end parameter

Hej,
I would like to know how the “start” and “end” options in screen.seqs work. I have about 4.5 M sequences and because of my data structure, I have to split the data set (alignment) after I aligned my sequences to my customised reference alignment. With gaps, this combined alignment was about 14k positions log, so I split at around 7k for a first glimpse on how many sequences would be in each part.
The first part has 400k of 4.5M and the latter part has 1.8M of 4.5M, which makes 2.2M combined in the split alignments.
I have two basic questions:
1.) Where do the 2M sequences disappear to?
2.) Looking at the summary files, there is still plenty of sequences that are longer than the indicated “end” in part one (and sequences that start earlier in part two). How is that possible?

screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1, end=7279, processors=16)

screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=7280, end=14594, processors=16)


Here’s the result summaries for the two parts:

Part 1

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 7413 208 0 3 1
2.5%-tile: 1 7435 292 0 5 11935
25%-tile: 1 8236 379 0 6 119349
Median: 1 11458 441 0 6 238697
75%-tile: 1 12304 468 0 6 358045
97.5%-tile:1 12531 479 0 8 465459
Maximum: 1 14594 500 0 250 477393
Mean: 1 10459.5 418.7 0 6.05531

of unique seqs: 450707

total # of seqs: 477393

Part 2

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 14594 275 0 3 1
2.5%-tile: 2107 14594 300 0 5 49390
25%-tile: 2449 14594 409 0 6 493895
Median: 2561 14594 459 0 6 987789
75%-tile: 4104 14594 473 0 6 1481683
97.5%-tile:6306 14594 482 0 8 1926187
Maximum: 6937 14594 500 0 210 1975576
Mean: 3455.8 14594 432.702 0 6.16803

of unique seqs: 1806487

total # of seqs: 1975576

What does screen.seqs do? Am I using the wrong function for what I want to do?
Kind regards,
Karin

1.) Where do the 2M sequences disappear to?

The sequences not in the “good” file are scrapped. A list on them can be found in the *bad.accnos file created by the command.

2.) Looking at the summary files, there is still plenty of sequences that are longer than the indicated “end” in part one (and sequences that start earlier in part two). How is that possible?

I don’t think you are using the start and end parameters as they are intended. The start parameter is used to indicate to mothur a position that all sequences must start by. The end parameter is used to indicate to mothur a position that all sequence must end after.

Thank you for your help! I think I misunderstood the command.
So, Instead of splitting the alignment, I will modify (cut down in length) my reference alignment before aligning my sequences to it.