Pre.cluster output file question: names and count_table

When I run this:

pre.cluster(fasta = twist.good.unique.fasta, name = twist.good.names, group = twist.good.groups)

My output includes an updated count_table but no updated names file. I expected the names file to get updated, rather than the count_table.

How can I update my names file? I know that if I have a names file, I can use count.seqs to produce an updated count_table file. I’m not sure how to get an updated names file from my count_table.

Thank you!

Additional excerpts from the logfile, demonstrating my difficulty (I used ** to highlight the problem areas):

mothur > pre.cluster(fasta = twist.good.unique.fasta, name = twist.good.names, group = twist.good.groups)

Output File Names:
twist.good.count_table
twist.good.unique.precluster.fasta
twist.good.unique.precluster.count_table
(plus the twist.good.unique.precluster.{sample}.{primer}.map files) [*I summarized this part*]

chimera.vsearch(fasta=twist.good.unique.precluster.fasta, 
count=twist.good.unique.precluster.count_table, dereplicate='t', vsearch=r'~/bin/vsearch')

Output File Names:
twist.good.unique.precluster.denovo.vsearch.pick.count_table
twist.good.unique.precluster.denovo.vsearch.chimeras
twist.good.unique.precluster.denovo.vsearch.accnos
twist.good.unique.precluster.{sample}.{primer}.count_table files [*I summarized this part*]
twist.good.unique.precluster.{sample}.{primer}.fasta files [*I summarized this part*]

mothur > get.current()

Current files saved by mothur:
accnos=twist.good.unique.precluster.denovo.vsearch.accnos
fasta=twist.good.unique.precluster.fasta
group=twist.good.groups
**name=twist.good.count_table**
count=twist.good.unique.precluster.denovo.vsearch.pick.count_table
processors=72
summary=twist.good.unique.summary

mothur > cluster(count='current', method='unique', cutoff='unique')
Using twist.good.unique.precluster.denovo.vsearch.pick.count_table as input file for the count parameter.

Output File Names:
twist.good.unique.precluster.denovo.vsearch.pick.unique.list

mothur > remove.rare(list='current', count='current', nseqs=9, label='unique')
Using twist.good.unique.precluster.denovo.vsearch.pick.count_table as input file for the count parameter.
Using twist.good.unique.precluster.denovo.vsearch.pick.unique.list as input file for the list parameter.

Output File Names:
twist.good.unique.precluster.denovo.vsearch.pick.pick.count_table
twist.good.unique.precluster.denovo.vsearch.pick.unique.0.pick.list

mothur > summary.seqs(fasta='current', name='current')
**[ERROR]: Your count file contains 951427 unique sequences, but your fasta file contains 700123. File mismatch detected, quitting command.**

mothur > count.seqs(name='current', group='current')
Using twist.good.groups as input file for the group parameter.
**Using twist.good.count_table as input file for the name parameter.**
[ERROR]: Format_ is not in your groupfile, please correct.
(plus tons of other errors, because the name file it is using is not a name file, it's a count_table...)

We recommend using a count file instead of a names and groups file. The count file represents the same information in the names and groups files but in a more concise way. This allows commands run with the count file to use less memory and take less time to process. The pre.cluster command is memory and time intensive so if the name and group options are used, mothur converts these files to a count file.

Try this instead:

mothur > cluster(count=current, method=unique) - cluster ASV’s

mothur > remove.rare(list=current, count=current, nseqs=9) - remove seqs abundance less than desired from list and count files

mothur > list.seqs(count=current) - list ‘good’ seqs

mothur > get.seqs(fasta=current, accnos=current) - select seqs from fasta file present in the ‘good’ count file

mothur > summary.seqs(fasta=current, count=current) - summarize dataset

Thank you, Sarah, for explaining this. I’m now trying to use count_table in my pipeline instead of the name file. I may still want a way to update the name file at the end though, since it contains the list of identical read IDs for each unique sequence. Is there a way to update the name file to contain the same number of sequences as the count file?

Below is how I have modified the code so far. The only change to your suggestion is I have run unique.seqs after chimera.vsearch to update my name file at that position. It seems to work, until I get to the get.seqs command, where it grabs many more sequences from my name file than I expect (usually, it grabs the same number from both my group and name files).

I start with set.current to set my files to match the output of chimera.vsearch.

mothur > set.current(fasta=twist.good.unique.precluster.pick.fasta, count=twist.good.denovo.vsearch.pick.count_table, name=twist.good.pick.names, group=twist.good.pick.groups)
mothur > unique.seqs(fasta=current)
mothur > cluster(count=current, method=unique, cutoff=unique)
mothur > remove.rare(list=current, count=current, nseqs=9, label=unique)
mothur > list.seqs(count=current)
mothur > get.seqs(fasta=current, accnos=current, name=current, group=current)
Using twist.good.denovo.vsearch.pick.pick.accnos as input file for the accnos parameter.
Using twist.good.unique.precluster.pick.unique.fasta as input file for the fasta parameter.
Using twist.good.pick.groups as input file for the group parameter.
Using twist.good.unique.precluster.pick.names as input file for the name parameter.
Selected 9641854 sequences from your name file.
Selected 7827 sequences from your fasta file.
Selected 60382 sequences from your group file.

So then when I run summary.seqs(fasta=current, name=current), I get output that suggests I have 7,827 unique sequences and 9,661,854 total sequences. My count file has 60,382 unique sequences (and I can’t run summary.seqs(fasta=current, count=current) because of this file mismatch).

I’m sorry to take so much of your time. This was working perfectly before I added the pre.cluster command to my pipeline, and that suggests to me that I really don’t understand what’s going on as much as I thought I did! I really appreciate your help (and I also really appreciate how well-written and well-maintained Mothur is…it’s much easier to troubleshoot than most other programs I’ve used!)

We don’t recommend using the count file and the name / group files at the same time. It often results in file mismatches or command crashes. Let me explain with the commands you are running.

Set current file names
mothur > set.current(fasta=twist.good.unique.precluster.pick.fasta, count=twist.good.denovo.vsearch.pick.count_table, name=twist.good.pick.names, group=twist.good.pick.groups)

This will output a new names file and cause a file mismatch. All previous duplicate reads are replaced instead of merging with the new duplicates.
mothur > unique.seqs(fasta=current)

instead try this:

Merge duplicates and combine their counts in the count file

mothur > unique.seqs(fasta=current, count=current)

mothur > cluster(count=current, method=unique, cutoff=unique)
mothur > remove.rare(list=current, count=current, nseqs=9, label=unique)
mothur > list.seqs(count=current)

This is not doing what you think. The name and group file will be missing names.
mothur > get.seqs(fasta=current, accnos=current, name=current, group=current)

Consider this small example:

Fasta File:

seq1

seq2

seq4

seq7

Name file:

seq1 seq1,seq3,seq5
seq2 seq2,seq6
seq4 seq4
seq7 seq7, seq8

Group File:

seq1 group1
seq2 group2
seq3 group1
seq4 group2
seq5 group3
seq6 group3
seq7 group1
seq8 group2

Count File:

Representative_Sequence total group1 group2 group3
seq1 3 2 0 1
seq2 2 0 1 1
seq4 1 0 1 0
seq7 2 1 1 0

Run unique.seqs command - assume seq2 and seq4 are identical

mothur > unique.seqs(fasta=current)

Fasta File:

seq1

seq2

seq7

Name file:

seq1 seq1
seq2 seq2,seq4
seq7 seq7, seq8

The count and group file are unchanged and mismatched now.

Now run the cluster command:

mothur > cluster(count=current, method=unique, cutoff=unique)

List file:

label numOtus Otu1 Otu2 Otu3 Otu4
asv 4 Seq1 Seq2 Seq4 Seq7

Now let’s run remove.rare setting nseqs=2 due to the sample size:

mothur > remove.rare(list=current, count=current, nseqs=2)

List file:

label numOtus Otu1
asv 1 Seq1

Count File:

Representative_Sequence total group1 group3
seq1 3 2 1

The accnos file from list.seqs will contain only seq1. Now run get.seqs:

mothur > get.seqs(fasta=current, accnos=current, name=current, group=current)

Name file:

seq1 seq1

Group File:

seq1 group1

Fasta File

seq1

Seq1 should represent 3 sequences and 2 groups.

To your question about converting a count file to a name and group file. You can use the deunique.seqs command, deunique.seqs and unique.seqs command. You won’t get the original sequence names, but something like this:

mothur > deunique.seqs(count=current, fasta=current) - create a fasta and group file
mothur > unique.seqs(fasta=current) - combine duplicates to create name file (seq2 and seq4 are identical)

Count File:

Representative_Sequence total group1 group2 group3
seq1 3 2 0 1
seq2 2 0 1 1
seq4 1 0 1 0
seq7 2 1 1 0

Becomes this name file:

seq1_1 seq1_1,seq1_2,seq1_3
seq2_1 seq2_1,seq2_2,seq4_1
seq7_1 seq7_1,seq7_2

1 Like

I see it better now! Thank you so much. I’ll avoid using the name file. So far, just having the abundance number is all we need. If we need the read IDs for those identical sequences, we’ll have to address that down the road. One other thing I’ve noticed about the count file and group file (in our particular case, where we’re using Mothur on highly-multiplexed amplicon sequencing data with hundreds of primers instead of just one), is that the count file contains only the group (in our case, sample) info. Our group file splits on both sample and primer pair, which is exactly what we want (and why we’re using Mothur in the first place!) So if I use deunique.seqs to re-create the count file, I think it will create a group file that splits on sample only.

Anyways, this is a perfect explanation and I really appreciate how much effort you took to create it. You guys are really the best! I’ll stick with count file only; I’m 99% certain we only need the abundance info that is contained in the count file anyways. Thank you!

The deunique.seqs command will create group file with the same group names in the count file. If you used an oligos file to assign sequences to groups, the group names can include the primer and barcode names. Something like:

primer xxx xxx v3
primer xxx xxx v4
primer xxx xxx v5
barcode xxxxx xxxxx sample1
barcode xxxxx xxxxx sample2

barcode xxxxx xxxxx sampleN

Will create names like: v4.sample1, v3.sample1, v5.sample1, v4.sample2… v5.sampleN

If you don’t provide a primer name, the primer are removed and the sample names would be the sample as the barcode name.

primer xxx xxx
primer xxx xxx
primer xxx xxx
barcode xxxxx xxxxx sample1
barcode xxxxx xxxxx sample2

barcode xxxxx xxxxx sampleN

Will create names like: sample1, sample2… sampleN

— continue in this topic

Is there a way to check pre.cluster() is doing what we expect it to do ? i.e., one specific sequence is merged into a similar sequence (1 SNP). We want to know the names of both sequences to check on them. (I guess the information is all in the ‘before and after’ count_table files ? But it’s a bit hard to check, given our count_table file is huge)

previous files:
sal.good.pick.names
sal.good.pick.groups
sal.good.unique.pick.fasta

so I’m running

count.seqs(name=‘current’, group=‘current’)

Output File Names:
sal.good.pick.count_table

pre.cluster(fasta=‘current’, count=‘current’)

Output File Names:
sal.good.unique.pick.precluster.count_table
sal.good.unique.pick.precluster.fasta
and all the *.map files

from the line number of 2 count_tables (400K vs 600K), roughly 33% of the sequences were merged, we just want to make sure it’s doing the right thing.

Thanks !

Check out the map files.