Hi Pat, I am in the process of following your blog post and processing the SILVA v123 myself as per your instructions. However I have a question: how do you pick the positions for the 27f 1492r primers? I take it they are the result of aligning an e. coli primer trimmed (do you keep the primers inside or kick them out here?) to the exported fasta from ARB in the first step? However, this exported fasta is 25Gb large in the case of v123. This would seem like a computationally rather unpleasant task, but if there’s no other way I am willing to run it. Additionally: why the threshold at 5 ambiguous base calls? Is there a reason for that?
update
Upon running summary.seqs on the SILVA v123 I can see that 97.5% of the sequences have less than 5 ambiguous base calls. Hence I understand the choice for the base calls.
update 2
So I decided that since I have 32Gb of RAM anyway I could just as well run the align.seqs with the trimmed E. coli 16S sequence. I matched the primers with regex in notepad++ to the 16S of E. coli and cut out everything outside of the primer region. In this way the primers were still inside of the original sequence before aligning. Which resulted in alignment start at 1007 and end at 43242. However, this resulted in a NR .aling file that has a size of 2.9Gb, which is compared to release 119 a lot less than the 7.2Gb which we expect http://www.mothur.org/wiki/Silva_reference_files. This makes sense as from the initial 526361 sequences only 68555 were withheld after screening. This answered my second question and re-running with a trimmed coli without the primers got the numbers from the readme (1044 start and 43116 end). This still reduces the dataset to 198112 sequences (approx 40% of initial number), but seems to be more reasonable. Next I ran the commands to recreate the SEED alignment however here I have 14914 sequences rather than 15009 sequences of the v119. This surprises me a bit since I expect a newer release to have more sequences, but maybe the 100% quality scores to the SEED of silva got fewer than before? But then I get into real trouble with the regex to calculate the number of different taxa at each taxonomic level. If I just run the R code I get:
>full.matrix
phyla class order family genus n.seqs
Bacteria 36 71 157 335 0 465754
Archaea 9 19 18 34 0 21597
Eukaryota 9 22 39 78 106 39010
Which is very odd and unexpected. When I check what is going on, I find that sub.tax (here for bacteria) is looking good:
> head(sub.tax)
[1] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas"
[2] "Bacteria;Firmicutes;Bacilli;Bacillales;Paenibacillaceae;Paenibacillus"
[3] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Acetobacteraceae;Gluconacetobacter"
[4] "Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces"
[5] "Bacteria;Acidobacteria;Acidobacteria;Acidobacteriales;Acidobacteriaceae_(Subgroup_1);Edaphobacter"
[6] "Bacteria;Actinobacteria;Thermoleophilia;Solirubrobacterales;Solirubrobacteraceae;Solirubrobacter"
However, when you would look at phylum level you get a lot of phyla that still contain “bacteria” but also a phylum name
> as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
[1] "Acetothermia;"
[2] "Acidobacteria;"
[3] "Actinobacteria;"
[4] "Aquificae;"
[5] "Armatimonadetes;"
[6] "Atribacteria;"
[7] "Bacteria;Acetothermia"
[8] "Bacteria;Actinobacteria"
[9] "Bacteria;Aerophobetes"
(... output truncated...)
[41] "Bacteria;Saccharibacteria"
[42] "Bacteria;SBYG-2791"
[43] "Bacteria;SHA-109"
[44] "Bacteria;SM2F11"
[45] "Bacteria;TA06"
[46] "Bacteria;TM6"
[47] "Bacteria;WCHB1-60"
[48] "Bacteria;WD272"
(... output truncated...)
[74] "Synergistetes;"
[75] "Tenericutes;"
[76] "Thermodesulfobacteria;"
[77] "Thermotogae;"
[78] "Verrucomicrobia;"
So I’m guessing something’s off with the regex, though I just cannot find out exactly what?
All input is, as usual, warmly welcome.
Update 3
Ok, I’m the inpatient type
I changed some of the R code using the function
cSplit
in the package “splitstackshape”. I think with base R’s
strsplit
it should be possible to however a bit more work would be required. I also had to adapt the Eukaryota levels because for one reason or another the initial grepping did not work and resulted in 7 instead of 6 levels.
library(splitstackshape)
getNumTaxaNames <- function(file, kingdom){
taxonomy <- read.table(file=file, row.names=1)
sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])
sub.tax.spl<-cSplit(as.data.frame(sub.tax),1,sep=";")
if(kingdom!="Eukaryota")
{
setnames(sub.tax.spl,old=colnames(sub.tax.spl),new=c("Regnum","Phylum","Classis","Ordo","Familia","Genus"))
}else{
setnames(sub.tax.spl,old=colnames(sub.tax.spl),new=c("Regnum","Supergroup","Phylum","Classis","Ordo","Familia","Genus"))
}
phyla <- nlevels(sub.tax.spl$Phylum)
classis <- nlevels(sub.tax.spl$Classis)
ordo <- nlevels(sub.tax.spl$Ordo)
familia <- nlevels(sub.tax.spl$Familia)
genus <- nlevels(sub.tax.spl$Genus)
n.seqs <- length(sub.tax)
return(c(phyla=phyla, classis=classis, ordo=ordo, familia=familia, genus=genus, n.seqs=n.seqs))
}
This resulted in the following result for the full.matrix, which still looks a bit “off” for the Eukaryota:
> full.matrix
phyla class order family genus n.seqs
Bacteria 66 170 322 655 2554 465754
Archaea 20 47 37 68 139 21597
Eukaryota 30 83 175 475 492 39010
My nr and seed matrices look like this:
> nr.matrix
phyla class order family genus n.seqs
Bacteria 64 160 311 623 2447 152308
Archaea 17 31 30 58 135 3901
Eukaryota 24 75 131 399 356 16209
> seed.matrix
phyla class order family genus n.seqs
Bacteria 52 102 194 388 1723 12083
Archaea 7 13 18 30 82 294
Eukaryota 15 25 47 86 72 2537
Finally the matrix ratio’s are:
> nr.matrix / full.matrix
phyla class order family genus n.seqs
Bacteria 0.969697 0.9411765 0.9658385 0.9511450 0.9581049 0.3270138
Archaea 0.850000 0.6595745 0.8108108 0.8529412 0.9712230 0.1806269
Eukaryota 0.800000 0.9036145 0.7485714 0.8400000 0.7235772 0.4155088
> seed.matrix / full.matrix
phyla class order family genus n.seqs
Bacteria 0.7878788 0.6000000 0.6024845 0.5923664 0.6746280 0.02594288
Archaea 0.3500000 0.2765957 0.4864865 0.4411765 0.5899281 0.01361300
Eukaryota 0.5000000 0.3012048 0.2685714 0.1810526 0.1463415 0.06503461
So overall I think I did quite okay except for the Eukaryota in this case. I am also still wondering why the SEED has less sequences than the v119 SEED.
I only work with bacteria and I do think the taxonomies are independent of this regex filtering. I will tar the archives and make them available, as soon as I can replicate the README file with my work, so I just need to find a way to properly filter the Eukaryota.
Kind regards,
FM