creating a .taxonomy file for a customized database

herrlebert · April 11, 2017, 1:40pm

Hello everyone,
I have been using mothur (v.1.38.1) for some time now and want to create a .taxonomy and .fasta file that can be used with the classify.seqs() command. As a template, I downloaded an NCBI database which I got here: Accessing the Hidden Kingdom: Fungal ITS Reference Sequences - NCBI Insights. Since I am working with fungual-ITS sequences, I have done a taxonomic assignment using mothur and the UNITE-databse in a suitable format beforehand. I have tried to re-create this format like it is described here: Redirecting…. All the sequences from the NCBI dataset are now in a fasta file that has the following format:

“UNIQUEHEADER1”
“THESEQUENCE1”
“UNIQUEHEADER2”
“THESEQUENCE2”
…
As unique header, I used GI-Numbers, meaning the header only contains a unique number and the starting “>”.
I also created the .taxonomy-file by extracting the taxonomy from the NCBI-files, the format looks like this:

1169078893 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Microascales;f__Microascaceae;g__Wardomycopsis;s__litoralis;
1169078892 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Microascales;f__Microascaceae;g__Wardomyces;s__ovalis;
…
Where the starting number is the GI number of the corresponding sequence.
The GI number is separated from the taxonomy by a tabstop (“\t”), after the species level information follows a newline sign(“\n”). I also tried some other things like four whitespaces, one whitespace etc. Sadly, when trying to use the classify.seqs() command with my own data to classify, I always get the same error-message: “‘408877218’ is in your template file and is not in your taxonomy file. Please correct.”, and that for every sequence in my template file. I checked if something was wrong with the names, but every header of the .fasta-file exists in the .taxonomy file. I reckon that something is still wrong with the formating of the files but can’t figure out what it is. Would be glad if someone could help! Thanks a lot!

Kendra · April 12, 2017, 4:02pm

Here’s my code for creating a taxonomy file from the unite database

#create taxonomy file from Unite database
gawk '/^>/{print $0}' UNITE_public_24.09.12.fasta >test.txt
gawk 'BEGIN { FS="|"}{OFS="\t"} {print $1, $5}' test.txt >test1.txt
sed 's/>//g' test1.txt>test2.txt
sed 's/ //g' test2.txt>UNITE.tax

I’ve never made one from NCBI DB so not sure what all you’d need to adjust

RobinRohwer · April 24, 2017, 9:37pm

I had a similar error once when using long numeric strings as sequence identifiers. At some point in a processing script they got imported as numeric and changed into scientific notation, which gave me a similar names-don’t-match-between-taxonomy-and-fasta error in mothur. You could try adding a letter at the beginning of your identifiers to rule that out (note in my case the bug was introduced in my own script, not in mothur).

Topic		Replies	Views
classify.seqs possible bug mothur bugs	5	10302	October 8, 2014
formatting database into mothur format Commands in mothur	7	3815	September 14, 2016
mothur crashes using classify.seqs and UNITE database Commands in mothur	4	1910	July 20, 2016
[Name] is already in your taxonomy file. Names must be unique Commands in mothur	5	923	December 2, 2021
classify.seqs errors in reading taxonomy file mothur bugs	1	5872	May 21, 2014

creating a .taxonomy file for a customized database

Related topics