Finding .taxonomy files for Classify.seqs?

Does anyone know how the SILVA .taxonomy file was obtained or created? I would like to look at using RDP or greengenes 16s aligned databases as templates for the classify.seqs function, but I haven’t been able to find the corresponding .taxonomy files.

Hi there azle694,
If you could give me a few days, I’ll post the analogous files for greengenes, ncbi, and rdp. Basically, what I did was to extract the NDS information corresponding to the silva taxonomy information from the SILVA-provided ARB database. I then selected only those sequences that were in the reference alignment (; note: this is my attempt to reconstruct the SILVA SEED database). Finally, I converted any spaces in the taxonomy strings to underscores - “_” and added a semicolon - “;” to the end of each line.

This database has just over 14,000 sequences in it. I think anyone should be able to construct a similar database for any gene family or to use a more complete database using this type of approach. The nice thing about the ARB database is that it has the fields for the different taxonomies. I think the greengenes dataset does as well. A big unresolved question is how big the database should be. Have fun exploring!


Hi - I have a similar question about your SILVA taxonomy. I’d like to directly compare classification results from different methods, so it would be very helpful if they’re on the same hierarchy. Is your RDP hierarchy mapping for the SILVA seed database based on the RDP Training Set 4, or the newer (early 2010) Training Set 5 or 6? (I hear the new RDP hierarchy reorganized some of Firmicutes and cyanobacteria, and I believe this is the default now on their website.) Thank you very much for your time and your extremely useful software!

Our RDP taxonomy outline is what the SILVA folks pulled out of the RDP. The actual RDP training set is available, but we opted for the SILVA because we were able to make it more comprehensive.

Thanks for the fast reply! So, the silva.rdp.taxonomy mapping file is just the results of searching the SILVA subset against the RDP classifier? I’m interested in trying a number of different 16S classification databases, but I’d like to have them all mapped to the same hierarchy if possible (for comparison). But, perhaps this is a misinformed effort on my part?

Not exactly - I got the RDP taxonomy outline from the appropriate field in the SILVA-provided arb database. They also provide the greengenes, NCBI/EMBL, and SILVA taxonomies for each sequence

Thanks Pat! I appreciate your helpfulness. I’ll just try both rdp.taxonomy and slv.taxonomy, and if neither gives me an overabundance of “Bacteria;unclassified_Bacteria” I’ll run with the RDP hierarchy. Thanks again.

Hi Are there taxonomy files for silva archaea and eukaryote fasta files which can be used with the classify.seqs command?

Yes - please see I just posted these and would appreciate any feedback people have about the completeness of these references.

Hi Pschloss, Are there a silva reference files based on the silva104? And other question, I want to improve the classify of my Archaea data, cause I get lots of unclassified ones now when I use the mothur applied silva template, so do you have any advises? I download the silva 104, but I do’t know how can I use it.

I am wondering if I can align to the SILVA database and classify with greengenes and if the classify.seqs command is the way to do this. I used the SILVA database earlier on when going through the 454 SOP but I’m wondering when I enter the classify.seqs command, do I enter in the gg_99 .fasta and .tax as the template and taxonomy files? This seems right to me…ie. I’ve already aligned to SILVA at this point and I am just taking my data and classifying it against the Greengenes database…right? Just making sure. When aligning the RDP training set 9, I have loads of unclassified sequences. I think more will be identified using Greengenes. Any suggestions? and thanks!