SILVA v123

Hopefully this is posted in the correct place, but I am wondering if there are plans for formatting the newest release of the SILVA reference database (v123, released a little over a month ago) for use in Mothur.

I ask this since I am at the University of Illinois using the IM-TORNADO pipeline which relies on Mothur and thus needs its formatting to utilize a reference database. I am very interested in using the most up-to-date information.

Thank you for your time!
Brett

This is on my list of things to do. Although, people are perfectly welcome to take the code I’ve posted for the previous version and run it for 123 :slight_smile:

http://blog.mothur.org/2014/08/08/SILVA-v119-reference-files/

Pat

Hi Pat, I am in the process of following your blog post and processing the SILVA v123 myself as per your instructions. However I have a question: how do you pick the positions for the 27f 1492r primers? I take it they are the result of aligning an e. coli primer trimmed (do you keep the primers inside or kick them out here?) to the exported fasta from ARB in the first step? However, this exported fasta is 25Gb large in the case of v123. This would seem like a computationally rather unpleasant task, but if there’s no other way I am willing to run it. Additionally: why the threshold at 5 ambiguous base calls? Is there a reason for that?

update
Upon running summary.seqs on the SILVA v123 I can see that 97.5% of the sequences have less than 5 ambiguous base calls. Hence I understand the choice for the base calls.

update 2
So I decided that since I have 32Gb of RAM anyway I could just as well run the align.seqs with the trimmed E. coli 16S sequence. I matched the primers with regex in notepad++ to the 16S of E. coli and cut out everything outside of the primer region. In this way the primers were still inside of the original sequence before aligning. Which resulted in alignment start at 1007 and end at 43242. However, this resulted in a NR .aling file that has a size of 2.9Gb, which is compared to release 119 a lot less than the 7.2Gb which we expect http://www.mothur.org/wiki/Silva_reference_files. This makes sense as from the initial 526361 sequences only 68555 were withheld after screening. This answered my second question and re-running with a trimmed coli without the primers got the numbers from the readme (1044 start and 43116 end). This still reduces the dataset to 198112 sequences (approx 40% of initial number), but seems to be more reasonable. Next I ran the commands to recreate the SEED alignment however here I have 14914 sequences rather than 15009 sequences of the v119. This surprises me a bit since I expect a newer release to have more sequences, but maybe the 100% quality scores to the SEED of silva got fewer than before? But then I get into real trouble with the regex to calculate the number of different taxa at each taxonomic level. If I just run the R code I get:

>full.matrix
          phyla class order family genus n.seqs
Bacteria     36    71   157    335     0 465754
Archaea       9    19    18     34     0  21597
Eukaryota     9    22    39     78   106  39010

Which is very odd and unexpected. When I check what is going on, I find that sub.tax (here for bacteria) is looking good:

> head(sub.tax)
[1] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas"
[2] "Bacteria;Firmicutes;Bacilli;Bacillales;Paenibacillaceae;Paenibacillus"
[3] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Acetobacteraceae;Gluconacetobacter"
[4] "Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces"
[5] "Bacteria;Acidobacteria;Acidobacteria;Acidobacteriales;Acidobacteriaceae_(Subgroup_1);Edaphobacter"
[6] "Bacteria;Actinobacteria;Thermoleophilia;Solirubrobacterales;Solirubrobacteraceae;Solirubrobacter"

However, when you would look at phylum level you get a lot of phyla that still contain “bacteria” but also a phylum name

>  as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
 [1] "Acetothermia;"
 [2] "Acidobacteria;"
 [3] "Actinobacteria;"
 [4] "Aquificae;"
 [5] "Armatimonadetes;"
 [6] "Atribacteria;"
 [7] "Bacteria;Acetothermia"
 [8] "Bacteria;Actinobacteria"
 [9] "Bacteria;Aerophobetes"
(... output truncated...)
[41] "Bacteria;Saccharibacteria"
[42] "Bacteria;SBYG-2791"
[43] "Bacteria;SHA-109"
[44] "Bacteria;SM2F11"
[45] "Bacteria;TA06"
[46] "Bacteria;TM6"
[47] "Bacteria;WCHB1-60"
[48] "Bacteria;WD272"
(... output truncated...)
[74] "Synergistetes;"
[75] "Tenericutes;"
[76] "Thermodesulfobacteria;"
[77] "Thermotogae;"
[78] "Verrucomicrobia;"

So I’m guessing something’s off with the regex, though I just cannot find out exactly what?
All input is, as usual, warmly welcome.

Update 3
Ok, I’m the inpatient type :wink:
I changed some of the R code using the function

cSplit

in the package “splitstackshape”. I think with base R’s

strsplit

it should be possible to however a bit more work would be required. I also had to adapt the Eukaryota levels because for one reason or another the initial grepping did not work and resulted in 7 instead of 6 levels.

library(splitstackshape)
getNumTaxaNames <- function(file, kingdom){
 taxonomy <- read.table(file=file, row.names=1)
 sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])
 sub.tax.spl<-cSplit(as.data.frame(sub.tax),1,sep=";")
 if(kingdom!="Eukaryota")
 {
  setnames(sub.tax.spl,old=colnames(sub.tax.spl),new=c("Regnum","Phylum","Classis","Ordo","Familia","Genus"))
 }else{
  setnames(sub.tax.spl,old=colnames(sub.tax.spl),new=c("Regnum","Supergroup","Phylum","Classis","Ordo","Familia","Genus"))
 }
 phyla <- nlevels(sub.tax.spl$Phylum)
 classis <- nlevels(sub.tax.spl$Classis) 
 ordo <- nlevels(sub.tax.spl$Ordo)
 familia <- nlevels(sub.tax.spl$Familia)
 genus <- nlevels(sub.tax.spl$Genus)
 
 n.seqs <- length(sub.tax)
 return(c(phyla=phyla, classis=classis, ordo=ordo, familia=familia, genus=genus, n.seqs=n.seqs))
}

This resulted in the following result for the full.matrix, which still looks a bit “off” for the Eukaryota:

> full.matrix
          phyla class order family genus n.seqs
Bacteria     66   170   322    655  2554 465754
Archaea      20    47    37     68   139  21597
Eukaryota    30    83   175    475   492  39010

My nr and seed matrices look like this:

> nr.matrix
          phyla class order family genus n.seqs
Bacteria     64   160   311    623  2447 152308
Archaea      17    31    30     58   135   3901
Eukaryota    24    75   131    399   356  16209
> seed.matrix
          phyla class order family genus n.seqs
Bacteria     52   102   194    388  1723  12083
Archaea       7    13    18     30    82    294
Eukaryota    15    25    47     86    72   2537

Finally the matrix ratio’s are:

> nr.matrix / full.matrix
             phyla     class     order    family     genus    n.seqs
Bacteria  0.969697 0.9411765 0.9658385 0.9511450 0.9581049 0.3270138
Archaea   0.850000 0.6595745 0.8108108 0.8529412 0.9712230 0.1806269
Eukaryota 0.800000 0.9036145 0.7485714 0.8400000 0.7235772 0.4155088
> seed.matrix / full.matrix
              phyla     class     order    family     genus     n.seqs
Bacteria  0.7878788 0.6000000 0.6024845 0.5923664 0.6746280 0.02594288
Archaea   0.3500000 0.2765957 0.4864865 0.4411765 0.5899281 0.01361300
Eukaryota 0.5000000 0.3012048 0.2685714 0.1810526 0.1463415 0.06503461

So overall I think I did quite okay except for the Eukaryota in this case. I am also still wondering why the SEED has less sequences than the v119 SEED.
I only work with bacteria and I do think the taxonomies are independent of this regex filtering. I will tar the archives and make them available, as soon as I can replicate the README file with my work, so I just need to find a way to properly filter the Eukaryota.

Kind regards,

FM

Hi all,

If anyone wants to use the v123 release of SILVA, I followed Pat’s instructions and formatted it for you.
I am still a bit doubtful on the tables for Eukaryotic taxonomy (see above) but overall I think I followed the instructions and adapted them where necessary.
SILVA apparently admitted on the 22nd that there were issues with their Eukaryotic taxonomy, and updated some files but the SSUref arb doesn’t appear to be updated?
I was also worried that the SEED got smaller (?) but apparently this is indeed the case according to SILVA themselves:
(from Release 123)

The SSU Seed was extended with latest LTP version (121).
Cleaning up the Seed has resulted in the removal of 1986 sequences.

My colleague Tim has helped me in hosting it on our Lab’s website (I’m not a hero in hosting stuff):
http://www.labmet.ugent.be/content/silva-v123-reference-files
he also had some fun with a combined silva/mothur logo on our frontpage (if you have the time to go see it, I think it’s a nice touch) :lol:

We are about to send out a new batch of 400 samples for MiSeq with V3-V4 primers and will use the v123 for alignment (SEED?) and classification (nr). I will keep you all posted if anything weird would pop up. I will also compare the v119 with the v123 on our in-house mock community.

Kind regards,

FM

Hello

I am trying to figure out how to make protozoa survey using MiSeq sequencing.

Are protozoa present in this release of Silva?

Many thanks!

Hi Alex,

None of the users of this forum is a SILVA maintainer and I suppose most of us work on bacteria or archaea.

If you want to see if SILVA contains protozoal SSU sequences (which I suppose it does) I do recommend that you look for yourself (instead of asking here).
It took me exactly three seconds :wink: to google some protozoal taxonomy (what a mess by the way, good luck! :? ) and find in the SILVA browser (http://www.arb-silva.de/browser/) that there are representatives of Amoebozoa, Archaeaplastida and Excavata in there.

Just look for the genera you are interested in using SILVA’s online tools such as the search function: http://www.arb-silva.de/search/.

If they are not in there but you have your own curated sequence database remember that you can always make you own mothur taxonomy outlines or extend the existing ones. If you do so, don’t forget to share them with the community :slight_smile:

Good luck,

FM

Oh thx for the browser, I did not noticed it.

If anybody else is into protozoa community analysis, I would be please to make contact with you.

I am more of a user then a developper…

I also have a hard time with the nomenclature, seems not everyone is using the same.

I have found theses ressources if anyone is interested:

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=2759&lvl=3&lin=f&keep=1&srchmode=1&unlock

http://www.organismnames.com/query.htm?searchType=tree&pp=10&so=a0&q=Protozoa

So I relooked at my in silico pcr in release 119 and yes there is come protozoa in my results. I was just going too fast and looking for “protozoa” but I was not using the correct classification. There is plenty of protozoa, for exemple alveolata results.

Case closed (for now). Thanks for pointing me in the right direction. I will see what I can find in the 123 release.

Hi Alex, for you and other users that want to get a quick overview of the taxonomy include in the current silva release, I played around a bit with Krona (https://github.com/marbl/Krona/wiki) and now added Krona visualizations of the SEED and NR release 123 to our mothur hosting page under the taxonomic representation part: http://labmet.ugent.be/content/silva-v123-reference-files#taxonomic-representation.
Have fun with it!

Kind regards,

FM

Nice work, thanks!

FM,
Is there a way to get the Eukaryota taxonomy to extend farther down? Right now in your SILVA v.123, the taxonomy for the fungal records is truncated at a very coarse phylogenetic resolution (e.g., Basidiomycota, Ascomycota, etc).

Dear all,

Sorry I only get back to you now on this but we are experiencing some issues with the silva v123 release that I tried to format for mothur. I was in the lab for the last two months but now it is analysis time and we see weird things happening with our mock community.
When we align using the silva.seed v123 that I generated using the instructions on the mothur blog and classify with the RDP trainset 14 we saw that specific OTU’s are “switched”, i.e. their taxonomy is reversed when comparing the taxonomy with the NCBI BLAST result of the get.oturep output.
I am very limited on time but I will try to figure out what is going wrong this weekend.
If anybody else on this forum has an idea, please feel free to pitch in.

Kind regards,

FM

Hi,

As a follow up on the previous message I have some updates.
For one, we tried to replicate the issue with the release 119 SEED as an alignment database which appeared to be perfectly okay.
This is something that worries us a lot because the alignment shouldn’t impact the classification?
In both cases we used the RDP14 in classify.seqs and the classify.OTU afterwards. The BLAST results were obtained by blasting the output of get.oturep.
We will keep you posted if we are able to resolve the issue but as always, any and all input is welcome.

edit we also had the same issue with the SILVA v123 NR (not only the seed)

Kind regards,

FM

I saw that the SILVA v123 is now added to the mothur.org wiki.
I can see that I must have made some mistakes while splitting the taxonomies using cSplit from the splitstackshape package.
Thanks for fixing this!

Kind regards.

FM

From July there is a warning in the SILVA section of the 16S Wikipedia page (16S ribosomal RNA - Wikipedia) stating:

the latest version with taxonomies, SILVA_123.1_SSURef_Nr99_tax, contains errors in the taxonomy because the species name is taken from the source and the taxonomy, up to genus, is found by similarity, e.g. like a pseudomonas lineage up to genus with a cricket species: Pseudomonas;Teleogryllus commodus.

It seems an “anonymous” contribution and I cannot find other references about this issue. Has someone experienced this supposed problem with SILVA v123 taxonomy?
Many thanks.