Using new SILVA taxonomy file with classify.seqs

Hello,

we would like to use an updated version of the SILVA taxonomy for classifying OTUs with the classify.seqs command. One of our computer-tech savvy students created a taxonomy & a template file out of the current SILVA SSU Ref NR 111 trunc FASTA file that - as far as I can see - follows the rules for those files and look like the ones I downloaded from the mothur website (nogap.eukarya.fasta & silva.eukarya.silva.taxonomy). However, when I run the command

mothur > classify.seqs(fasta=KRAEK1final.fasta, template=SSURef111.fasta, taxonomy=SSURef111.taxonomy)

I get the output:

Reading in the SSURef111.taxonomy taxonomy
Done.
Generating search database


Followed after a while with a long list of error messages scrolling down the screen saying (for example):

Z99951.1.1760 is in your taxonomy file and is not in your template file. Please correct.
Done.
It took 212 seconds generate search database.
It took 223 seconds get probabilities.

I suppose the error is given for every sequence in the new SILVA files we created. I checked of course some of the sequences that are supposed to be missing, e.g. Z99951.1.1760, and they are found in both the taxonomy and the template file. Here is how they look in the files:

From the taxonomy file:

Z99951.1.1760 Eukaryota;Opisthokonta;Metazoa;Platyhelminthes;Turbellaria;Seriata;Romankenkius_libidinosus;

From the template file:

Z99951.1.1760
TAGTCATATGCTTATCTCAAAGATTAAGCCATGCATGTCTAAGTACAGAGATTTATATTCTAAAACCGCG
GATGGCTCATTATAACAGCTATGATTTGAGAGAACTAATCTTTTTGCTACAAGATAACTGTGGTAATTCT
AGAGCTAATATTTACAAGAATGCCGTGACTAACGAAGCGGCGGATTTATTAGATCAAAATCAACCAGGCA
CGCAAGTGTCGGTATATTGATGAATCTGGATAACTTTACTGATCAAACGACCTTGTGTCGTTGACGAATC
TCTTGAAATGGCTGACCTATCAACTTTCGATGGTAAGATCAAAGCTTACCATGGTTGTAACGGGTAACGG
GGAATCAGTGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCACGGAAGGCAGCAGGCGC
GTAAATTACCCAATACCGGCTCGGTGAGGTAGTGACAAAAAATAACAATATGGGCCCTAGTGGTTTCATA
ATTGCAATGAGAAAATTTTAAATACTTTATCAAGTATCAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGC
GGTAATTCCAGCTCCAAAAGCGTATATTAAAGTTGTTGCAGTTAAAACGCTCGTAGTTGAAATTGGGGAC
TTGCGACTAGTTGAAACCTATGGTTGATATTGGTTGTTTCCTTCGTCGTCGTGTATATCGCTGATGTTCT
TTAATGGATGTCGTCAATAACCGACAAGTTTACTTTGAAAAAATTAGAGTGCTTAAAGCAGGCTTACGCT
TGTATATTGTTGCATGGAATAATGAAATAGGACTTCGGTTTTATTTTGTTGGTTTTCGAAACTGAAGTAA
TGATTAAAAGAGACTGCCGGGGGCATATGTATGCTGGCGTTAGAGGTGAAATTCTTAGATCGTCAGCAGA
CAAACTACTGCGAAAGCATTTGCCAAGAATGTTTTCATTAATCAAGAACGAAAGTCAGAGGATCGAAGAC
GATCAGATACCGTCCTAGTTCTGACCGTAAACTATGCCAACTGACAGTTAGCATAAGGTAATTCAAATCT
CCTTTCTAGAAGTCACCGGGAAACCTAAGTCTATGGGTTCCGGGGGAAGTATGGTTGCAAAGCTGAAACT
TAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAATC
TCACCCGGTCCGGACACTGTGAGGATTGACAGATTGATAGCTCTTTCATGATTTGGTGGGTGGTGGTGCA
TGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTAATTCCGATAACGAACGAGACTCTGACCTGCTAA
ATAGTAAATTTGTGTTTAATCGCAAATCTTTACTTCTTAGAGGAATAAATAGCGTTTAGCTAAATGAAAT
GGAGCAATAACAGGTCTGTGATGCCCTTAGATGTCCGGGGCCGCACGCGCGCTACAATGGCGGTAACAAC
AAGTTTGTCCTGGCTAGAAATGGTTGGGTAATCTTGTGAATCACCGTCGTGTCTGGAATAGTGGATAGCA
ATTTTCCCACTTGAACGAGGAATTCCTAGTAAGCGCAAGTCATCAACTTGCGCTGATTACGTCCCTGCCC
TTTGTACACACCGCCCGTCGCTACTACCGATTGAATGGTTTAGTGAGATTGTTGGATTCTGCACTAAGAA
ATGGCAACATTTCTATGAATGGGAAAAGACACTCAAACTTGATCATTTAGAGGAAGTAAAAGTCGTAACA
AGGTGTCCGT

When I run the command for my my file with the SILVA databases from mothur I get a proper output and classification of my sequences.

mothur > classify.seqs(fasta=KRAEK1final.fasta, template=nogap.eukarya.fasta, taxonomy=silva.eukarya.silva.taxonomy)

Anyone able to help and spot the error we made?

Thank you in advance,

René

1 Like

Try removing the “>” characters from your taxonomy files

Thank you Pat! That did the trick and brought me one step further. I hoped that it would be something simple like this that I just overlooked.

However, I now ran into another error message that asked me explicitly to contact you. :slight_smile: Here is the log file:

Windows version

Running 64Bit Version

mothur v.1.27.0
Last updated: 8/8/2012

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type ‘help()’ for information on the commands that are available

Type ‘quit()’ to exit program
Interactive Mode


mothur > classify.seqs(fasta=KRAEK1final.fasta, template=SSURef111.fasta, taxonomy=SSURef111.taxonomy, processors=2)

Using 2 processors.

Reading in the SSURef111.taxonomy taxonomy
 DONE.
Generating search database
 DONE.
It took 147 seconds generate search database.
Calculating template taxonomy tree
 DONE.
Calculating template probabilities
 [ERROR]: std::bad_alloc has occurred in the Bayesian class function Bayesian. Please contact Pat Schloss at mothur.bugs@gmail.com, and be sure to include the mothur.logFile with your inquiry.

Edit: Using the k-Nearest Neighbor algorithm with the classify.seqs command does work with my file and the new SILVA database, the problem seems to be in the Bayesian algorithm.

The error message you are getting indicates you are running out of memory. Is the new template bigger? You might try running it with processors=1. The more processors you use the more memory is required.

Hi,

I am also very much interested in using a more recent version of the silva reference files to replace the current v102 with v111.
Are there plans to update the files on http://www.mothur.org/wiki/Silva_reference_files?
If not: is there a ‘protocol’/SOP available on how to transform the files from the ARB files on http://www.arb-silva.de/download/arb-files/ to a more mothur-friendly format?
Also: if any of the other mothur users already did this, would you be so kind to share it with me?

Kind regards,

Frederiek - Maarten Kerckhof

To make this clear, I would like to make a new silva v111 reference alignment myself, if I knew exactly how to do so.
From http://www.mothur.org/wiki/Silva_reference_files I understand that the flowchart is:

  1. look at alignment report and select only 100% quality score sequences that go from 8f/27f to 1492r.

Is this correct? If so I have a few practical questions on the several different steps

    1. Would it be better to use SSURef NR? Is this appropriate to select the ‘unique’ sequences? Or is 98% identity an issue for downstream processing in mothur (e.g. when using label ‘unique’ or 99)? Which is the best file to start from inhttp://www.arb-silva.de/no_cache/download/archive/release_111/Exports/?
  1. No questions here
  2. The 100% quality score filtering seems fairly easy to apply as a criterion however the 8f/27f-1492r seems a bit more challenging. Would it be better to use the ‘truncated’ files fromhttp://www.arb-silva.de/no_cache/download/archive/release_111/Exports/? These seem to be truncated to the effective LSU or SSU
    genes using the termini filter in ARB, whatever the border of the effective LSU/SSU might be


Thank you very much for any input.

Kind regards,

FM Kerckhof

  1. Identify unique sequences in the SSURef
    2)Use SINA aligner (> SINA> ) on these unique sequences
  2. look at alignment report and select only 100% quality score sequences that go from 8f/27f to 1492r.

So if you do this, you will basically get the silva.bacteria.fasta file that has been posted. It really hasn’t changed since the original because SILVA essentially has “proprietary” sequences that investigators submit and align poorly.

Thank you for the clarification, I was not aware that the alignment stayed the same since v102.

Kind regards,

FM

I am getting the same errors as described in this thread - Mothur is not connecting the entries in the Fasta file with the entries in the tax file and tells me “xxxxxxxxxx.x.x is in your template file and is not in your taxonomy file. Please correct.”

Here is a typical line from the “nogap” fasta file:

AB015712.1.587
GTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTATAAGCAAGTATACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTATTTGATAGTGCCTTACTACTTGGATAACCGTGGTAATTCTAGAGCTAATACATGCAAAAAATCCCGACTTCGGGAAGGGAAGTATTTATTAGATACAAAACCAATCTCGTCTTTGGGCGAGTTCCTTGGTGATTCATGGTAACTTTTCGAATCGCATGGCCTTGCGCCGGCGATGGTTCATTCAAATTTCTGCCCTATCAACTTTCGATGGTAGGATAGAGGCCTACCATGGTTTTAACGGGTAACGGGGAATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCCGACACGGGGAGGTAGTGACAATAAATAACAATACAGGGCCCTCACGGGTCTTGTAATTGGAATGAGTACAATTTAAATCTCTTAACGAGGAACAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATT

Here are two lines from the .tax file:

AB015712.1.587 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Kickxellomycotina;Glomeromycota;Archaeosporales;Ambispora;Ambispora_leptoticha;
AB015711.1.587 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Kickxellomycotina;Glomeromycota;Archaeosporales;Ambispora;Ambispora_leptoticha;

(identifier and taxonomy are separated by a tab - although that is difficult to see here).

Any suggestions? I have thus far been unable to discern any formatting differences between my custom files and the files “silva.eukarya.silva.tax” and “nogap.eukarya.fasta” from the silva reference set.

Thanks in advance.

I suspect you might have spaces in your taxonomy strings. Typically if you go up a line or two from where you’re getting the error in the taxonomy file you’ll find a sequence with a space. You’ll want to change those to “_”

Pat

I already took care of that in Notepad++ - there aren’t any spaces that I am aware of.

Hi jpiaskowski, are you using an updated Silva database? if so, is that possible to get it somewhere? or is one you built for your own work?
Or Pat, is that possible to have the last release 119 from the Silva database formatted for mothur? even the LSU sequences and the archaea sequences to analyze other organisms than bacteria?
Thanks!

I’m working on an update but it won’t be available until next week.

If you could post somewhere the silva files you’ve generated I can take a look.

Pat

Hi,

I would be happy to upload an attachment of my custom database files, but I’m not certain how to do that. I visited a few help pages on mothur and found how to manage existing files, but, I have no idea how to add new files. These are essentially text files (.fasta, .tax). Thanks.

Nice to know Pat that you will release a new formatted version of Silva database. It will be a full version, different files according to type of sequence or only bacterial seqs? Thanks!