creating a .taxonomy file for a customized database

Hello everyone,
I have been using mothur (v.1.38.1) for some time now and want to create a .taxonomy and .fasta file that can be used with the classify.seqs() command. As a template, I downloaded an NCBI database which I got here: https://ncbiinsights.ncbi.nlm.nih.gov/2015/05/11/accessing-the-hidden-kingdom-fungal-its-reference-sequences-2/. Since I am working with fungual-ITS sequences, I have done a taxonomic assignment using mothur and the UNITE-databse in a suitable format beforehand. I have tried to re-create this format like it is described here: https://www.mothur.org/wiki/Taxonomy_outline. All the sequences from the NCBI dataset are now in a fasta file that has the following format:

“UNIQUEHEADER1”
“THESEQUENCE1”
“UNIQUEHEADER2”
“THESEQUENCE2”

As unique header, I used GI-Numbers, meaning the header only contains a unique number and the starting “>”.
I also created the .taxonomy-file by extracting the taxonomy from the NCBI-files, the format looks like this:

1169078893 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Microascales;f__Microascaceae;g__Wardomycopsis;s__litoralis;
1169078892 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Microascales;f__Microascaceae;g__Wardomyces;s__ovalis;

Where the starting number is the GI number of the corresponding sequence.
The GI number is separated from the taxonomy by a tabstop ("\t"), after the species level information follows a newline sign("\n"). I also tried some other things like four whitespaces, one whitespace etc. Sadly, when trying to use the classify.seqs() command with my own data to classify, I always get the same error-message: “‘408877218’ is in your template file and is not in your taxonomy file. Please correct.”, and that for every sequence in my template file. I checked if something was wrong with the names, but every header of the .fasta-file exists in the .taxonomy file. I reckon that something is still wrong with the formating of the files but can’t figure out what it is. Would be glad if someone could help! Thanks a lot!

Here’s my code for creating a taxonomy file from the unite database

#create taxonomy file from Unite database
gawk '/^>/{print $0}' UNITE_public_24.09.12.fasta >test.txt
gawk 'BEGIN { FS="|"}{OFS="\t"} {print $1, $5}' test.txt >test1.txt
sed 's/>//g' test1.txt>test2.txt
sed 's/ //g' test2.txt>UNITE.tax

I’ve never made one from NCBI DB so not sure what all you’d need to adjust

I had a similar error once when using long numeric strings as sequence identifiers. At some point in a processing script they got imported as numeric and changed into scientific notation, which gave me a similar names-don’t-match-between-taxonomy-and-fasta error in mothur. You could try adding a letter at the beginning of your identifiers to rule that out (note in my case the bug was introduced in my own script, not in mothur).