Trouble keeping both updated names file and a count_table

The MiSeq SOP tends to work primarily with count_table (which I guess replaces the old .groups file), rather than names file, so as to keep information about different samples.

Many functions however (e.g. unique.seqs, split.abund, pre.cluster etc.) only accept either a names file or a count_table. The result is that after a few steps of sequence gleaning with count_table the names file is no longer up to date and trying to use it will give an error. The trouble is that there’s no way to convert the two (names into count_table and vice versa).
For most analysis just having the count_table file is fine but for working with some external tools I need work with a de-uniqued fasta file. However, the function deunique.seqs only accepts a names file…

Any solution to the problem?

I tried running functions in parallel (once with a names file and once with a count_table) but ran into discrepancies after a while.

Thanks for the feature request! We will add the count table to the deunique.seqs command for the next release of mothur.

Thanks a lot for considering my needs.
I was wondering if there’s a workaround in the meantime.
I’ve noticed that things start to diverge from the point of running pre.cluster(). The function behaves differently when given a name file vs a count_table. If I run it with count_table:

pre.cluster(fasta=xxx, count=xxx, diffs=3)

running Then summary.seqs() gives different results when used with name file vs when using the new count_table:

summary.seqs(fasta=XXX.precluster.fasta, count=XXX.precluster.count_table)

gives:

of unique seqs: 9165

total # of seqs: 348502

but:

summary.seqs(fasta=XXX.precluster.fasta, name=XXX.names)

gives:

of unique seqs: 9165

total # of seqs: 281982

Why is that? and how can I work around it?

Thanks again for all the support.

Hmmm… That looks odd. Are you sure your names and count files have the same number of lines? Can you try running the count.groups command of the count file and the names and groups file?

Before running pre.cluster()
The count_table and names files have exactly the same # of sequences:

summary.seqs(fasta=XXX.fasta, name=XXX.names)

of unique seqs: 27096

total # of seqs: 348502

summary.seqs(fasta=XXX.fasta, count=XXX.count_table)

of unique seqs: 27096

total # of seqs: 348502

After running pre.cluster():

pre.cluster(fasta=XXX.fasta, count=XXX.count_table, diffs=3)

I use the new fasta and count_table files:

summary.seqs(fasta=XXX.precluster.fasta, count=XXX.precluster.count_table)

and get:

of unique seqs: 9165

total # of seqs: 348502

Or the new fasta file with the old names file (I have no new names file):

summary.seqs(fasta=XXX.precluster.fasta, name=XXX.names)

and get:

of unique seqs: 9165

total # of seqs: 281982

What’s happening? pre.cluster() isn’t suppose to reduce the number of sequences only the number of unique sequences.

Using mothur v.1.32.0

Thanks!

Hi,
I have a similar problem with the pre.cluster command (mothur v.1.33).
I ran pre.cluster 3 times with the same input fasta file and corresponding 1) count_table, 2) name file and 3) name and group file.
Each time the total number of sequences stayed the same (as expected), but the number of unique sequences varied dramatically (several thousands) between the three runs.
Shouldn’t the number of sequences that get ‘merged’ into unique sequences stay the same for the same input, regardless of which group (i.e. sample) the sequences are from?
If the group information does matter during pre.cluster, why is there still a difference in the output between option 1 and 3?
Does it matter that I split the job to several processors for 3?

pre.cluster(fasta=All.unique.good.filter.unique.fasta, count=All.unique.good.filter.count_table, diffs=3)
unique 120858
total 576401

pre.cluster(fasta=All.unique.good.filter.unique.fasta, name=All.unique.good.filter.names, diffs=3)
unique 112858
total 5764ß1

pre.cluster(fasta=All.unique.good.filter.unique.fasta, name=All.unique.good.filter.names, diffs=3, group=All.good.groups, processors=8)
unique 120618
total 576401

Thanks!

When running pre.cluster without groups the names and count files should match. Can you try running the following?

mothur > pre.cluster(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.names, diffs=2)
mothur > summary.seqs(name=current)

of unique seqs: 5271

total # of seqs: 67746

mothur > make.table(name=GQY1XT001.shhh.trim.unique.good.filter.names)
mothur > pre.cluster(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.fasta, count=current, diffs=2)
mothur > summary.seqs(count=current)

of unique seqs: 5271

total # of seqs: 67746

When running pre.cluster with groups there may be some slight variation. This is caused by the ties in abundance. The pre.cluster command clusters within the groups. It sorts the sequences by abundance and in the case of ties by sequence name. When using the names file, mothur uses the names of the redundant sequences from the names file. The count file only has unique sequence names, so the order of ties in abundance can be slightly different.

mothur > pre.cluster(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.names, group=GQY1XT001.shhh.good.groups, diffs=2)
mothur > summary.seqs(name=current)

of unique seqs: 6804

total # of seqs: 67746

mothur > make.table(name=GQY1XT001.shhh.trim.unique.good.filter.names, group=GQY1XT001.shhh.good.groups)
mothur > pre.cluster(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.fasta, count=current, diffs=2)
mothur > summary.seqs(count=current)

of unique seqs: 6782

total # of seqs: 67746