count table contains more than 1 sequence

After using the align.seqs command to align my sequences to the silva database, I use the screen.seqs command to filter the aligned sequences and provide the countable to get an updated sequence count.

The updated sequences count, however contains duplicate sequence names wherefore the unique.seqs command fails with “[ERROR]: Your count table contains more than 1 sequence named M01867_162_000000000-AFMW9_1_2119_6612_17173, sequence names must be unique. Please correct.”

I checked and it does contain this sequence twice (both singletons), but removing one copy just postponed the error to the next duplicated sequence. I tried running screen.seqs command on a single core but the error was the same.

I aligned the sequences to the silva database that I prepared as described in the tutorial (by predicting the e.coli fragment with the primers that I used, 341F and 805R, then aligning the fragment (without primers) to the silva.bacteria.fasta file and using the aligned sequence in pcr.seqs command.)

Ps.: (and unrelated to the question) I’m aware that my sequences do not fully overlap and that this can be regarded as a problem. To be honest, I haven’t been involved in the choice of primers. However, these primers are extremely widely used for environmental sequences, mostly because the are supposed to have a small phylogenetic bias and amplify both bacteria and archaea. I don’t want to imply that so many scientists can’t be wrong (sure they can) but It seems like there must be an upside, too? Or why are they so widely spread?

They’re definitely wrong :slight_smile:

Can you post the exact commands you are running and the error message that pops up?

pat

Good to know, I’ll tell them when I meet them :slight_smile:

here are the exact commands I used

aligning the V3V4 region of Ecoli to silva reference

align.seqs(fasta=Ecoli_V34_exp.fasta, reference=silva.bacteria.fasta)

subsetting the silva reference to relevant region

pcr.seqs(fasta=silva.bacteria.fasta, ecoli=Ecoli_V34_exp.align, keepdots=F, processors=16)
system(mv silva.bacteria.pcr.fasta silva.v4.fasta)

aligning my contigs to silva

align.seqs(fasta=fileList.paired.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)
summary.seqs(fasta=fileList.paired.trim.contigs.good.unique.align, count=fileList.paired.trim.contigs.good.count_table)

filter aligned sequences

screen.seqs(fasta=fileList.paired.trim.contigs.good.unique.align, count=fileList.paired.trim.contigs.good.count_table, summary=fileList.paired.trim.contigs.good.unique.summary, start=2, end=17014, maxhomop=8)

truncate both ends and take out all-gaps

filter.seqs(fasta=fileList.paired.trim.contigs.good.unique.good.align, vertical=T, trump=.)

re-run unique.seq in case that some sequences became identical after truncation

unique.seqs(fasta=fileList.paired.trim.contigs.good.unique.good.filter.fasta, count=fileList.paired.trim.contigs.good.good.count_table)

[ERROR]: Your count table contains more than 1 sequence named M01867_162_000000000-AFMW9_1_2119_6612_17173, sequence names must be unique. Please correct.

The E.Coli sequence that I used for the V3V4 loop covered by my primers is taken from ncbi (primer blast against Ecoli 16S, extract by me). In case its relevant, here it is:

gi|174375|gb|J01859.1|ECORRD E.coli 16S ribosomal RNA V34 341F 805R
GTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGT
TGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAA
GAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTG
GGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCAT
CTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGA
TCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGG
AGCAAACAGG

How did you generate the count_table file?

The full code that I used for the analysis up to this point reads like this:

creating the input files for make.contigs and redirect in output folder

make.file(input=./Trimmed_reads, type=fastq)

join paired reads and redirect in output folder

make.contigs(file=./make_file_out/fileList.paired.file, processors=16)

summarise

summary.seqs(fasta=fileList.paired.trim.contigs.fasta, processors=16)

remove all sequences longer than 430 bp, shorter than 416 bp and with any ambiguous base-calls

screen.seqs(fasta=fileList.paired.trim.contigs.fasta, summary=fileList.paired.trim.contigs.summary, maxambig=0, minlength=416, maxlength=430, maxhomop=9,  processors=16)"

summarise

summary.seqs(fasta=fileList.paired.trim.contigs.good.fasta, processors=16)

collapse unique seqeunces

unique.seqs(fasta=fileList.paired.trim.contigs.good.fasta)

count sequences

count.seqs(name=fileList.paired.trim.contigs.good.names, processors=16)

summarise

summary.seqs(fasta=fileList.paired.trim.contigs.good.unique.fasta, count=fileList.paired.trim.contigs.good.count_table, processors=16)

[… see post above from here …]

In the file generated by the make.file command are there any duplicate filenames? In the log file, can you see the full path of the files mothur is reading in make.contigs?

The file I am referring to is ./make_file_out/fileList.paired.file.

I have the same duplicated sequence (just one in a big file) in count_table file that was generated by screen.seqs.

just a small update: I re-run the whole pipeline up to this point (checking that there were no duplicate file names in make file) and received the same error. However checking with

sort FILE | uniq -cd

for which sequences were duplicated in the countfile, I only got one sequence (Incidentally? the very last) which I could remove with

sed -i.bak '/LINENUMBER/d' ./FILE

(ht stack overflow for both). NOTE: both are system commands and do not run in mothur as is. So either they must be called in the bash shell or called from within mothur with the system() command (I believe)

After that the pre.cluster command run without problems.

given that it was only a single sequence and only present once, it feels save to remove it. It’s weird though that it arises in the first place.

thanks for your help

best

Fabian

Could there be an extra sequence in the original fastq files? I would recommend just removing the one offending sequence.