count table contains more than 1 sequence

fabianr · February 18, 2016, 3:05pm

After using the align.seqs command to align my sequences to the silva database, I use the screen.seqs command to filter the aligned sequences and provide the countable to get an updated sequence count.

The updated sequences count, however contains duplicate sequence names wherefore the unique.seqs command fails with “[ERROR]: Your count table contains more than 1 sequence named M01867_162_000000000-AFMW9_1_2119_6612_17173, sequence names must be unique. Please correct.”

I checked and it does contain this sequence twice (both singletons), but removing one copy just postponed the error to the next duplicated sequence. I tried running screen.seqs command on a single core but the error was the same.

I aligned the sequences to the silva database that I prepared as described in the tutorial (by predicting the e.coli fragment with the primers that I used, 341F and 805R, then aligning the fragment (without primers) to the silva.bacteria.fasta file and using the aligned sequence in pcr.seqs command.)

Ps.: (and unrelated to the question) I’m aware that my sequences do not fully overlap and that this can be regarded as a problem. To be honest, I haven’t been involved in the choice of primers. However, these primers are extremely widely used for environmental sequences, mostly because the are supposed to have a small phylogenetic bias and amplify both bacteria and archaea. I don’t want to imply that so many scientists can’t be wrong (sure they can) but It seems like there must be an upside, too? Or why are they so widely spread?

pschloss · February 23, 2016, 11:41am

They’re definitely wrong

Can you post the exact commands you are running and the error message that pops up?

pat

fabianr · February 23, 2016, 1:57pm

Good to know, I’ll tell them when I meet them

here are the exact commands I used

aligning the V3V4 region of Ecoli to silva reference

align.seqs(fasta=Ecoli_V34_exp.fasta, reference=silva.bacteria.fasta)

subsetting the silva reference to relevant region

pcr.seqs(fasta=silva.bacteria.fasta, ecoli=Ecoli_V34_exp.align, keepdots=F, processors=16)
system(mv silva.bacteria.pcr.fasta silva.v4.fasta)

aligning my contigs to silva

align.seqs(fasta=fileList.paired.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)

summary.seqs(fasta=fileList.paired.trim.contigs.good.unique.align, count=fileList.paired.trim.contigs.good.count_table)

filter aligned sequences

screen.seqs(fasta=fileList.paired.trim.contigs.good.unique.align, count=fileList.paired.trim.contigs.good.count_table, summary=fileList.paired.trim.contigs.good.unique.summary, start=2, end=17014, maxhomop=8)

truncate both ends and take out all-gaps

filter.seqs(fasta=fileList.paired.trim.contigs.good.unique.good.align, vertical=T, trump=.)

re-run unique.seq in case that some sequences became identical after truncation

unique.seqs(fasta=fileList.paired.trim.contigs.good.unique.good.filter.fasta, count=fileList.paired.trim.contigs.good.good.count_table)

[ERROR]: Your count table contains more than 1 sequence named M01867_162_000000000-AFMW9_1_2119_6612_17173, sequence names must be unique. Please correct.

The E.Coli sequence that I used for the V3V4 loop covered by my primers is taken from ncbi (primer blast against Ecoli 16S, extract by me). In case its relevant, here it is:

gi|174375|gb|J01859.1|ECORRD E.coli 16S ribosomal RNA V34 341F 805R
GTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGT
TGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAA
GAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTG
GGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCAT
CTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGA
TCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGG
AGCAAACAGG

pschloss · February 25, 2016, 12:40pm

How did you generate the count_table file?

fabianr · February 29, 2016, 8:56am

The full code that I used for the analysis up to this point reads like this:

creating the input files for make.contigs and redirect in output folder

make.file(input=./Trimmed_reads, type=fastq)

join paired reads and redirect in output folder

make.contigs(file=./make_file_out/fileList.paired.file, processors=16)

summarise

summary.seqs(fasta=fileList.paired.trim.contigs.fasta, processors=16)

remove all sequences longer than 430 bp, shorter than 416 bp and with any ambiguous base-calls

screen.seqs(fasta=fileList.paired.trim.contigs.fasta, summary=fileList.paired.trim.contigs.summary, maxambig=0, minlength=416, maxlength=430, maxhomop=9,  processors=16)"

summarise

summary.seqs(fasta=fileList.paired.trim.contigs.good.fasta, processors=16)

collapse unique seqeunces

unique.seqs(fasta=fileList.paired.trim.contigs.good.fasta)

count sequences

count.seqs(name=fileList.paired.trim.contigs.good.names, processors=16)

summarise

summary.seqs(fasta=fileList.paired.trim.contigs.good.unique.fasta, count=fileList.paired.trim.contigs.good.count_table, processors=16)

[… see post above from here …]

westcott · March 8, 2016, 1:43pm

In the file generated by the make.file command are there any duplicate filenames? In the log file, can you see the full path of the files mothur is reading in make.contigs?

westcott · March 9, 2016, 9:21pm

The file I am referring to is ./make_file_out/fileList.paired.file.

yangfa1 · March 10, 2016, 12:46pm

I have the same duplicated sequence (just one in a big file) in count_table file that was generated by screen.seqs.

fabianr · March 21, 2016, 7:49pm

just a small update: I re-run the whole pipeline up to this point (checking that there were no duplicate file names in make file) and received the same error. However checking with

sort FILE | uniq -cd

for which sequences were duplicated in the countfile, I only got one sequence (Incidentally? the very last) which I could remove with

sed -i.bak '/LINENUMBER/d' ./FILE

(ht stack overflow for both). NOTE: both are system commands and do not run in mothur as is. So either they must be called in the bash shell or called from within mothur with the system() command (I believe)

After that the pre.cluster command run without problems.

given that it was only a single sequence and only present once, it feels save to remove it. It’s weird though that it arises in the first place.

thanks for your help

best

Fabian

westcott · March 22, 2016, 7:27pm

Could there be an extra sequence in the original fastq files? I would recommend just removing the one offending sequence.

Topic		Replies	Views
align.seqs and summary.seqs issue mothur bugs	3	1669	September 27, 2018
Recurring "Your count table contains more than 1 sequence named..." yet doesn't mothur bugs	12	314	July 16, 2023
summary.seqs error - count table not unique Commands in mothur	2	1492	January 12, 2016
Unique names issue	2	240	October 18, 2023
Countfile and fasta.align not the same amount of sequences Commands in mothur	2	96	June 21, 2024

count table contains more than 1 sequence

Related topics