classify.seqs errors in reading taxonomy file

Hey guys, in my previous post (classify.seqs error: xxx is in temp file but not in tax file), I created my own tax file and template file for fungal 28S rRNA based on Silva LSURef 115 release and I got lots of errors like ‘‘CP002767.4’ is in your template file and is not in your taxonomy file. Please correct.’ Pat suggested me to confirm if the line number of template file is exactly twice as it in tax file and I made a change in the original silva fasta file after. However, one thing I didn’t mention in the last post, which I thought I could figure it out by myself is ‘missing the final ‘;’’, something like these:

At first, I thought I could just add the final ‘;’ by discarding the species/strain name in the tax file. But the same errors kept showing up. So I readjusted the whitespaces and trimmed the ACCN filed to ‘XXnnnn.n’. Yet it didn’t work out and the name trimming was a mistake as it resulted in ‘[ERROR]: FN650747.1 is already in your taxonomy file, names must be unique.’. And errors became even weirder i.e AF297515.1 is missing the final ; or Eukaryota;Viridiplantae;Streptophyta;Embryophyta;Tracheophyta;Spermatophyta;Magnoliophyta;eudicotyledons;core is already in your taxonomy file, names must be unique. For me, I think Eukaryota;Viridiplantae;Streptophyta;Embryophyta;Tracheophyta;Spermatophyta;Magnoliophyta;eudicotyledons;core should not be a name.

Generally speaking , classify.seqs was reading my tax file in an abnormal way. And there’s no difference between the data format (whitespaces, code, linebreaks etc.) of my customised tax file and standard silva 18S rRNA tax file downloaded from Mothur website.

Here is a link of google drive, leading to a folder containing the tax file i created, the 18S rRNA tax file, the LSURef 115 fasta file I used, mothur logfile, and the Matlab scripts I used to generate my tax file.

Any idea would be welcome. Thanks.

Here’s the problem - some of your taxonomy strings have spaces in them…

AATN01000020.2 Bacteria;Proteobacteria;Betaproteobacteria;Rhodocyclales;Rhodocyclaceae;Candidatus Accumulibacter;

Try making those spaces into underscores…

AATN01000020.2 Bacteria;Proteobacteria;Betaproteobacteria;Rhodocyclales;Rhodocyclaceae;Candidatus_Accumulibacter;

We read the tax file in expecting only whitespace between the sequence name and it’s taxonomy. So if you have a space like you do here, it thinks Acuumulibacter; is a sequence name. Give that a shot…