we would like to use an updated version of the SILVA taxonomy for classifying OTUs with the classify.seqs command. One of our computer-tech savvy students created a taxonomy & a template file out of the current SILVA SSU Ref NR 111 trunc FASTA file that - as far as I can see - follows the rules for those files and look like the ones I downloaded from the mothur website (nogap.eukarya.fasta & silva.eukarya.silva.taxonomy). However, when I run the command
Reading in the SSURef111.taxonomy taxonomyâŠDone.
Generating search databaseâŠ
Followed after a while with a long list of error messages scrolling down the screen saying (for example):
Z99951.1.1760 is in your taxonomy file and is not in your template file. Please correct.
Done.
It took 212 seconds generate search database.
It took 223 seconds get probabilities.
I suppose the error is given for every sequence in the new SILVA files we created. I checked of course some of the sequences that are supposed to be missing, e.g. Z99951.1.1760, and they are found in both the taxonomy and the template file. Here is how they look in the files:
When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.
Distributed under the GNU General Public License
Type âhelp()â for information on the commands that are available
Type âquit()â to exit program
Interactive Mode
Reading in the SSURef111.taxonomy taxonomy⊠DONE.
Generating search database⊠DONE.
It took 147 seconds generate search database.
Calculating template taxonomy tree⊠DONE.
Calculating template probabilities⊠[ERROR]: std::bad_alloc has occurred in the Bayesian class function Bayesian. Please contact Pat Schloss at mothur.bugs@gmail.com, and be sure to include the mothur.logFile with your inquiry.
Edit: Using the k-Nearest Neighbor algorithm with the classify.seqs command does work with my file and the new SILVA database, the problem seems to be in the Bayesian algorithm.
The error message you are getting indicates you are running out of memory. Is the new template bigger? You might try running it with processors=1. The more processors you use the more memory is required.
I am also very much interested in using a more recent version of the silva reference files to replace the current v102 with v111.
Are there plans to update the files on http://www.mothur.org/wiki/Silva_reference_files?
If not: is there a âprotocolâ/SOP available on how to transform the files from the ARB files on http://www.arb-silva.de/download/arb-files/ to a more mothur-friendly format?
Also: if any of the other mothur users already did this, would you be so kind to share it with me?
To make this clear, I would like to make a new silva v111 reference alignment myself, if I knew exactly how to do so.
From http://www.mothur.org/wiki/Silva_reference_files I understand that the flowchart is:
look at alignment report and select only 100% quality score sequences that go from 8f/27f to 1492r.
Is this correct? If so I have a few practical questions on the several different steps
Would it be better to use SSURef NR? Is this appropriate to select the âuniqueâ sequences? Or is 98% identity an issue for downstream processing in mothur (e.g. when using label âuniqueâ or 99)? Which is the best file to start from inhttp://www.arb-silva.de/no_cache/download/archive/release_111/Exports/?
No questions here
The 100% quality score filtering seems fairly easy to apply as a criterion however the 8f/27f-1492r seems a bit more challenging. Would it be better to use the âtruncatedâ files fromhttp://www.arb-silva.de/no_cache/download/archive/release_111/Exports/? These seem to be truncated to the effective LSU or SSU
genes using the termini filter in ARB, whatever the border of the effective LSU/SSU might beâŠ
Identify unique sequences in the SSURef
2)Use SINA aligner (> SINA> ) on these unique sequences
look at alignment report and select only 100% quality score sequences that go from 8f/27f to 1492r.
So if you do this, you will basically get the silva.bacteria.fasta file that has been posted. It really hasnât changed since the original because SILVA essentially has âproprietaryâ sequences that investigators submit and align poorly.
I am getting the same errors as described in this thread - Mothur is not connecting the entries in the Fasta file with the entries in the tax file and tells me âxxxxxxxxxx.x.x is in your template file and is not in your taxonomy file. Please correct.â
Here is a typical line from the ânogapâ fasta file:
(identifier and taxonomy are separated by a tab - although that is difficult to see here).
Any suggestions? I have thus far been unable to discern any formatting differences between my custom files and the files âsilva.eukarya.silva.taxâ and ânogap.eukarya.fastaâ from the silva reference set.
I suspect you might have spaces in your taxonomy strings. Typically if you go up a line or two from where youâre getting the error in the taxonomy file youâll find a sequence with a space. Youâll want to change those to â_â
Hi jpiaskowski, are you using an updated Silva database? if so, is that possible to get it somewhere? or is one you built for your own work?
Or Pat, is that possible to have the last release 119 from the Silva database formatted for mothur? even the LSU sequences and the archaea sequences to analyze other organisms than bacteria?
Thanks!
I would be happy to upload an attachment of my custom database files, but Iâm not certain how to do that. I visited a few help pages on mothur and found how to manage existing files, but, I have no idea how to add new files. These are essentially text files (.fasta, .tax). Thanks.
Nice to know Pat that you will release a new formatted version of Silva database. It will be a full version, different files according to type of sequence or only bacterial seqs? Thanks!