Hello everyone,
I have been using mothur (v.1.38.1) for some time now and want to create a .taxonomy and .fasta file that can be used with the classify.seqs() command. As a template, I downloaded an NCBI database which I got here: Accessing the Hidden Kingdom: Fungal ITS Reference Sequences - NCBI Insights. Since I am working with fungual-ITS sequences, I have done a taxonomic assignment using mothur and the UNITE-databse in a suitable format beforehand. I have tried to re-create this format like it is described here: Redirecting…. All the sequences from the NCBI dataset are now in a fasta file that has the following format:
“UNIQUEHEADER1”
“THESEQUENCE1”
“UNIQUEHEADER2”
“THESEQUENCE2”
…
As unique header, I used GI-Numbers, meaning the header only contains a unique number and the starting “>”.
I also created the .taxonomy-file by extracting the taxonomy from the NCBI-files, the format looks like this:
1169078893 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Microascales;f__Microascaceae;g__Wardomycopsis;s__litoralis;
1169078892 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Microascales;f__Microascaceae;g__Wardomyces;s__ovalis;
…
Where the starting number is the GI number of the corresponding sequence.
The GI number is separated from the taxonomy by a tabstop (“\t”), after the species level information follows a newline sign(“\n”). I also tried some other things like four whitespaces, one whitespace etc. Sadly, when trying to use the classify.seqs() command with my own data to classify, I always get the same error-message: “‘408877218’ is in your template file and is not in your taxonomy file. Please correct.”, and that for every sequence in my template file. I checked if something was wrong with the names, but every header of the .fasta-file exists in the .taxonomy file. I reckon that something is still wrong with the formating of the files but can’t figure out what it is. Would be glad if someone could help! Thanks a lot!