Hej,
I would like to know how the “start” and “end” options in screen.seqs work. I have about 4.5 M sequences and because of my data structure, I have to split the data set (alignment) after I aligned my sequences to my customised reference alignment. With gaps, this combined alignment was about 14k positions log, so I split at around 7k for a first glimpse on how many sequences would be in each part.
The first part has 400k of 4.5M and the latter part has 1.8M of 4.5M, which makes 2.2M combined in the split alignments.
I have two basic questions:
1.) Where do the 2M sequences disappear to?
2.) Looking at the summary files, there is still plenty of sequences that are longer than the indicated “end” in part one (and sequences that start earlier in part two). How is that possible?
screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1, end=7279, processors=16)
screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=7280, end=14594, processors=16)
Here’s the result summaries for the two parts:
Part 1
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 7413 208 0 3 1
2.5%-tile: 1 7435 292 0 5 11935
25%-tile: 1 8236 379 0 6 119349
Median: 1 11458 441 0 6 238697
75%-tile: 1 12304 468 0 6 358045
97.5%-tile:1 12531 479 0 8 465459
Maximum: 1 14594 500 0 250 477393
Mean: 1 10459.5 418.7 0 6.05531
of unique seqs: 450707
total # of seqs: 477393
Part 2
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 14594 275 0 3 1
2.5%-tile: 2107 14594 300 0 5 49390
25%-tile: 2449 14594 409 0 6 493895
Median: 2561 14594 459 0 6 987789
75%-tile: 4104 14594 473 0 6 1481683
97.5%-tile:6306 14594 482 0 8 1926187
Maximum: 6937 14594 500 0 210 1975576
Mean: 3455.8 14594 432.702 0 6.16803
of unique seqs: 1806487
total # of seqs: 1975576
What does screen.seqs do? Am I using the wrong function for what I want to do?
Kind regards,
Karin