Names file vs count_table after pre.cluster()

Re-posting in case this went unnoticed.

I’m getting a discrepancy when counting the sequences after pre.cluster when using a count_table vs a names file.
What could be the cause?

Before running pre.cluster()
The count_table and names files have exactly the same # of sequences:

summary.seqs(fasta=XXX.fasta, name=XXX.names)

of unique seqs: 27096

total # of seqs: 348502

summary.seqs(fasta=XXX.fasta, count=XXX.count_table)

of unique seqs: 27096

total # of seqs: 348502

After running pre.cluster():

pre.cluster(fasta=XXX.fasta, count=XXX.count_table, diffs=3)

I use the new fasta and count_table files:

summary.seqs(fasta=XXX.precluster.fasta, count=XXX.precluster.count_table)

and get:

of unique seqs: 9165

total # of seqs: 348502

Or the new fasta file with the old names file (I have no new names file):

summary.seqs(fasta=XXX.precluster.fasta, name=XXX.names)

and get:

of unique seqs: 9165

total # of seqs: 281982

What’s happening? pre.cluster() isn’t suppose to reduce the number of sequences only the number of unique sequences.

Using mothur v.1.32.0

Thanks!

Could you send your fasta, name and count file to mothur.bugs@gmail.com?

The num mismatch is not a bug. You are using the pre.clustered fasta file with a names file that was not pre clustered. In the count file, as each sequence was merged with the representative its counts were added to the unique sequences counts thus preserving the overall sequence totals. In the names file, no merging was done. In summary.seqs command, as mothur reads the names file the sequences that were merged by pre clustered command are ignored because they are not in the preclustered fasta file. To compare the files you would want to run:

summary.seqs(fasta=yourFasta, name=yourName)
summary.seqs(fasta=yourFasta, count=yourCount)
pre.cluster(fasta=yourFasta, count=yourCount, diffs=3)
pre.cluster(fasta=yourFasta, name=yourName, group=yourGroup, diffs=3)
summary.seqs(fasta=yourFasta, name=yourName)
summary.seqs(fasta=yourFasta, count=yourCount)