Pre.cluster removes more sequences than there were to start


I am running pre.cluster on v1.42.0 and the screen outputs during processing indicate removal of more sequences than it says are present in the sample.

Processing group BH1bkm:
BH1bkm 15681 -16455 32136
Total number of sequences before pre.cluster was 15681.
pre.cluster removed 32136 sequences.

It took 2 secs to cluster 15681 sequences.

Can someone explain what is happening here? The process seems to work fine. I just find these status updates confusing. Thanks.


That’s weird and shouldn’t be happening. Can you do two things for us? Can you try with 1.42.3 and post the actual command you were running? We did some tweaks to pre.cluster over the last 3 minor releases and that may be the issue.


Run today on v1.42.3.

pre.cluster(fasta=iu_choneArchuniquealigngoodfilter.unique.good.fasta, count=iu_choneArchuniquealigngoodfilter.good.count_table, diffs=4, processors=6)

Partial output from the logfile:
Processing group HF4dpn:
LF2bkpn 32352 -46753 79105
Total number of sequences before pre.cluster was 32352.
pre.cluster removed 79105 sequences.

It took 14 secs to cluster 32352 sequences.

Processing group LF3bkm:
BH1bkpn 22640 -28136 50776
Total number of sequences before pre.cluster was 22640.
pre.cluster removed 50776 sequences.

It took 9 secs to cluster 22640 sequences.

Processing group BH1f:
LF3bkm 5379 -3117 8496
Total number of sequences before pre.cluster was 5379.
pre.cluster removed 8496 sequences.

It took 1 secs to cluster 5379 sequences.


It seems that it is just reversing the ‘numbers before pre.cluster’ and ‘numbers removed’ in the in the logfile, which makes more sense than removing more sequences than there are to remove. Also, the “processing group X” often doesn’t match the group stats on the next line, which I assume is just due to the overlapping outputs from the 6 processors. Again, the process seems to work fine.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.