Silva 132?

Hello,

I was wondering if there are plans to offer the mothur formatted silva files for the new 132 release in the near future?

Thanks!

I have been working on mothur formatted silva files from Silva 132 using my primers. I have been following Pat Schloss’ workflow in the README.md for doing so. These questions pertain to that, so I will copy/paste the quotes from which my questions are derived.

“The mothur commands above do several things. First the screen.seqs command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (>900 bp) archaeal 16S rRNA gene sequences.”

“The Archaea take a beating and recall they lost a bunch of sequences in the initial steps since many of the arachaeal sequences in SILVA are between 900 and 1200 nt long. If you are interested in analyzing the Archaea and the Eukaryota, I would suggest duplicating my efforts here but modify the screen.seqs and pcr.seqs steps to target your region of interest.”

In my understanding “screen.seqs” here is screening all seqs that are outside of the region defined in the silva.full_v132.fasta by your oligos.
Q: If that is the case, isn’t it your primers not screen.seqs that is ultimately causing the removal of the archaeal sequences?

I ask because I am interested in both bacteria and archaea.

Thanks in advance.

Yeah, it’s on the list of things to do. Bug me again if it isn’t done in a month.

Pat

Ok, thank you! My attempt isn’t going as well as I’d hoped. There are some issues with polaribacter. In the silva txt file there are polaribacter and polaribacter 1-5 which doesn’t match up well with the .full file, as the naming of polaribacter in the .arb file doesn’t differentiate between these, giving you NAs in the R workflow for formatting the .tax file. Hopefully that was clear, and can help when you’re making the files.

With the silva 128 files you provide already, which is to be used as the reference in classify seqs? It seems like we could use the degapped fasta made during the formatting of the silva files, otherwise I’m unsure what should be used.

Thank you

you can use the aligned or degapped version of the sequence files.

I think I might have written a successful script which is modified from your blog for making mothur-formatted Silva files with Silva 132, based on my primers and the inherent problems with formatting Silva 132 in R. It starts from ARB and ends in R.

###### Silva 132 ##########
# You will need :
# ARB for mothur formatted fasta
# tax_slv_ssu_132.txt for creating taxonomy file
# for additional info on each step see: http://blog.mothur.org/2017/03/22/SILVA-v128-reference-files/


####setting up ARB on mac laptop#######

curl -N https://www.arb-silva.de/fileadmin/arb_web_db/release_132/ARB_files/SILVA_132_SSURef_NR99_13_12_17_opt.arb.gz

gunzip SILVA_132_SSURef_NR99_13_12_17_opt.arb.gz


#install arb on mac; supercomp too hard
sudo port selfupdate
sudo port upgrade outdated
sudo port install arb


#Fetching archive for openmotif
#Error: Failed to archivefetch openmotif: xorg-libXt must be installed with +flat_namespace.
sudo port install xorg-libXt +flat_namespace
sudo port install arb

#open new terminal
arb_macsetup #set ARBHOME and ARB to PATH
#Your shell already has the right PATH environment variable for use with MacPorts!
#Backing up your /Users/alexiswalker/.bash_profile shell confguration file as /Users/alexiswalker/.bash_profile.macports-saved_2017-12-19_at_15:48:47 before adapting it for ARB.
#An appropriate PATH variable has been added to your shell environment by the ARB installer.
#An appropriate ARBHOME variable has been added to your shell environment by the ARB installer.
#You have succesfully installed ARB
#Open a new terminal window and type arb to launch ARB
# Important notes can be viewed at any time by typing 'port notes arb'
#Please cite: Wolfgang Ludwig, et al. (2004) ARB: a software environment for sequence data. Nucleic Acids Research. 32:1363-1371

arb SILVA_132_SSURef_NR99_13_12_17_opt.arb

#Within arb do the following:

#1.  Click the search button
#2.  Set the first search field to 'ARB\_color' and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)
#3.  Click 'Search'. This yielded 577,832 hits
#4.  Click the "Mark Listed Unmark Rest" button
#5.  Close the "Search and Query" box
#6.  Now click on File-\>export-\>export to external format
#7.  In this box the `Export` option should be set to `marked`, `Filter` to `none`, and `Compression` should be set to `no`.
#8.  In the field for `Choose an output file name enter` make sure the path has you in the correct working directory and enter `silva.full_v132.fasta`.
#9.  Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it's taxonomy across the top line. To create one for you will need to create `fasta_mothur.eft` in the `/opt/local/share/arb/lib/export/` folder with the following:

silva.full_v132.fasta

cd /opt/local/share/arb/lib/export
sudo nano fasta_mothur.eft

SUFFIX          fasta
BEGIN
>*(acc).*(name)\t*(align_ident_slv)\t*(tax_slv);
*(|export_sequence)

#  You can now quit arb.

####### Create align and tax files ########
wget http://ftp.arb-silva.de/release_132/Exports/taxonomy/tax_slv_ssu_132.txt #get silva reference txt

#I noticed there is a discrepency with Polaribacter naming between silva.full_v132.fasta & tax_slv_ssu_132.txt
# silva txt - Polaribacter_1 ; silva fasta - Polaribacter 1
#find and replace
sed -i 's/Polaribacter 1/Polaribacter_1/g' silva.full_v132.fasta &
sed -i 's/Polaribacter 2/Polaribacter_2/g' silva.full_v132.fasta &
sed -i 's/Polaribacter 3/Polaribacter_3/g' silva.full_v132.fasta &
sed -i 's/Polaribacter 4/Polaribacter_4/g' silva.full_v132.fasta &
sed -i 's/Polaribacter;Polaribacter;/Polaribacter;/g' silva.full_v132.fasta &
sed -i 's/Polaribacter;Polaribacter_3;/Polaribacter_3;/g' silva.full_v132.fasta &
sed -i 's/[][]//g' silva.full_v132.fasta &

grep -nr "Polaribacter" silva.full_v132.fasta > Polaribacter.txt & #check to make sure successful

# create align file with my primers

#my primers
nano BA.oligos
forward GTGYCAGCMGCCGCGGTAA
Reverse GGACTACNVGGGTWTCTAAT

mothur "#pcr.seqs(fasta=silva.full_v132.fasta, oligos=BA_oligos, processors=24)" # find where V4 starts and ends
#Output File Names:
#silva.full_v132.pcr.fasta
#silva.full_v132.bad.accnos
#silva.full_v132.scrap.pcr.fasta
#It took 220 secs to screen 629211 sequences.

mothur
degap.seqs(fasta=silva.full_v132.pcr.fasta, processors=10)
#Output File Names:
#silva.full_v132.pcr.ng.fasta
#It took 53 secs to degap 511899 sequences

unique.seqs(fasta=silva.full_v132.pcr.ng.fasta)
#Output File Names:
#silva.full_v132.pcr.ng.names
#silva.full_v132.pcr.ng.unique.fasta

head silva.full_v132.pcr.ng.unique.fasta
# AC201870.A6BReg23
# TACGGAGGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCATGTA

grep ">" silva.full_v132.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.full_v132.pcr.ng.unique.accnos

#head silva.full_v132.pcr.ng.unique.accnos
#AC201870.A6BReg23

mothur
get.seqs(fasta=silva.full_v132.pcr.fasta, accnos=silva.full_v132.pcr.ng.unique.accnos)
#Output File Names:
#silva.full_v132.pcr.pick.fasta
#Selected 231567 sequences from your fasta file.

#example of what resulting .align file should look like:
# >GCVF01000095.Th8Rotu4 91.46 Bacteria;Bacteroidetes;Cytophagia;Cytophagales;Flammeovirgaceae;Roseivirga;
#...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................#...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................#...............................................A-T--G-A--A-C-G-C--T-A-G-C--G-G--C-A-G-G----------C----C-T--AA-T--AC-A----T-G--C----A-A-G--T-C--GA-A-CG-----------G-TAG--G-T--T------------------------------------T---C-C------------------

head silva.full_v132.pcr.pick.fasta
# JQ771129.FYONec10 fpdiffs=0(match) rpdiffs=0(match)   95.59 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;
#through the process above "fpdiffs=0(match)" "rpdiffs=0(match)" was added to file
# remove fpdiffs=0(match) rpdiffs=0(match) and create align file
cut -f 1,4,5 silva.full_v132.pcr.pick.fasta > silva.nr_v132.align

#generate taxonomy file
grep '>' silva.nr_v132.align | cut -f1,3 | cut -f2 -d'>' > silva.nr_v132.taxfull


#example of what resulting taxonomy file should look like:
# GCVF01003767.Th8Rotu3 Bacteria;Bacteroidetes;Sphingobacteriia;Sphingobacteriales;NS11-12_marine_group;NS11-12_marine_group_ge;
# GCVF01000095.Th8Rotu4 Bacteria;Bacteroidetes;Cytophagia;Cytophagales;Flammeovirgaceae;Roseivirga;

head silva.nr_v132.taxfull
# AC201870.A6BReg23 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Candidatus Regiella;
# JQ765428.GH9Dis21 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;


#there was an issue with Escherichia, I think it was Escherichia-Shigella, in silva 128, lets see if still there in 132:
# still there let's just replace now in the align and tax files
grep -nr "Escherichia" tax_slv_ssu_132.txt &
sed -i 's/Escherichia-Shigella/Escherichia/g' tax_slv_ssu_132.txt &

grep -nr "Escherichia" silva.nr_v132.align &
sed -i 's/Escherichia-Shigella/Escherichia/g' silva.nr_v132.align &


grep -nr "Escherichia" silva.nr_v132.taxfull &
sed -i 's/Escherichia-Shigella/Escherichia/g' silva.nr_v132.taxfull &

#also need to make sure that Polaribacter is same in tax_slv_ssu_132.txt
grep -nr "Polaribacter" tax_slv_ssu_132.txt
# still there; replace
sed -i 's/Polaribacter 1/Polaribacter_1/g' tax_slv_ssu_132.txt
sed -i 's/Polaribacter 2/Polaribacter_2/g' tax_slv_ssu_132.txt
sed -i 's/Polaribacter 3/Polaribacter_3/g' tax_slv_ssu_132.txt
sed -i 's/Polaribacter 4/Polaribacter_4/g' tax_slv_ssu_132.txt
sed -i 's/Polaribacter;Polaribacter;/Polaribacter;/g' tax_slv_ssu_132.txt
sed -i 's/Polaribacter;Polaribacter_3;/Polaribacter_3;/g' tax_slv_ssu_132.txt
sed -i 's/[][]//g' tax_slv_ssu_132.txt

#looks good

######## R reformatting ############

map.in <- read.table("tax_slv_ssu_132.txt",header=F,sep="\t",stringsAsFactors=F)
map.in <- map.in[,c(1,3)]
head(map.in)
#                                    V1     V3
#1                              Archaea; domain
#2                Archaea;Altiarchaeota; phylum
#3   Archaea;Altiarchaeota;Altiarchaeia;  class

colnames(map.in) <- c("taxlabel","taxlevel")
map.in <- rbind(map.in, c("Bacteria;RsaHf231;", "phylum")) #wasn't in tax_slv_ssu_132.txt

map.in$taxlabel <- gsub("[[:space:]]+;", ";", map.in$taxlabel) #fix extra space in "Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Dothideomycetes;Pleosporales;Phaeosphaeriaceae;Parastagonospora ;"
map.in$taxlabel <- gsub(";[[:space:]]+$", ";", map.in$taxlabel)

taxlevels <- c("root","domain","major_clade","superkingdom","kingdom","subkingdom","infrakingdom","superphylum","phylum","subphylum","infraphylum","superclass","class","subclass","infraclass","superorder","order","suborder","superfamily","family","subfamily","genus")
taxabb <- c("ro","do","mc","pk","ki","bk","ik","pp","ph","bp","ip","pc","cl","bc","ic","po","or","bo","pf","fa","bf","ge")

tax.mat <- matrix(data="",nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] <- "root"
colnames(tax.mat) <- taxlevels

head(tax.mat)
#root   domain major_clade superkingdom kingdom subkingdom infrakingdom
#[1,] "root" ""     ""          ""           ""      ""         ""

outlevels <- c("domain","phylum","class","order","family","genus")

for (i in 1:nrow(map.in)) {
 taxname <- unlist(strsplit(as.character(map.in[i,1]), split=';'))
 #print(taxname);

  while ( length(taxname) > 0) {
  #regex to look for exact match

    tax.exp <- paste(paste(taxname,collapse=";"),";",sep="")
  tax.match <- match(tax.exp,map.in$taxlabel)
  tax.mat[i,map.in[tax.match,2]] <- tail(taxname,1)
  taxname <- head(taxname,-1)
  }
}

for (i in 1:nrow(tax.mat)) {
 #this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
 #if you don't want this behavior, cut it out

  for (j in 1:ncol(tax.mat)) {
  if(tax.mat[i,j] < 0) { tax.mat[i,j] <- paste(tmptax,taxabb[j],sep="_")}
  else { tmptax <- tax.mat[i,j]}
  }

  #this maps the new name to the input taxonomic levels
  map.in[i,"taxout"] <- paste(paste(tax.mat[i,outlevels],collapse=";"),";",sep="")
}

# replace spaces with underscores
map.in$taxout <- gsub(" ","_",map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in <- read.table("silva.nr_v132.taxfull",header=F,stringsAsFactors=F,sep="\t")
colnames(tax.in) <- c("taxid","taxlabel")

tax.in$taxlabel <- gsub("[[:space:]]+;", ";", tax.in$taxlabel) #fix extra space in "Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Dothideomycetes;Pleosporales;Phaeosphaeriaceae;Parastagonospora ;"
tax.in$taxlabel <- gsub(";[[:space:]]+$", ";", tax.in$taxlabel)

tax.in$id <- 1:nrow(tax.in)

tax.write <- merge(tax.in,map.in,all.x=T,sort=F)
tax.write <- tax.write[order(tax.write$id),]

getDepth <- function(taxonString){
  initial <- nchar(taxonString)
  removed <- nchar(gsub(";", "", taxonString))
  return(initial-removed)

}

depth <- getDepth(tax.write$taxout)
summary(depth) #should all be 6

summary(depth) #should all be 6
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
#      6       6       6       6       6       6      10

#If you have NAs you need to find which taxid and tax label has the NA and fix it, preferably before R and rerun R formatting
#since you have to reformat all files anyway

NA_tax.write <- tax.write[rowSums(is.na(tax.write)) > 0,]# find taxid and label w/ NA

#once fixed
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#      6       6       6       6       6       6

bacteria <- grepl("Bacteria;", tax.write$taxout)
archaea <- grepl("Archaea;", tax.write$taxout)
eukarya <- grepl("Eukaryota;", tax.write$taxout)

tax.write[depth > 6 & bacteria,] #good to go
tax.write[depth > 6 & archaea,]  #good to go
tax.write[depth > 6 & eukarya,]  #good to go

#you want 0 rows greater than 6

write.table(tax.write[,c("taxid","taxout")],file="silva.full_v132.tax",sep="\t",row.names=F,quote=F,col.names=F)


########get counts for each taxa/level
getNumTaxaNames <- function(file, kingdom){
  taxonomy <- read.table(file=file, row.names=1)
  sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla <- as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
  phyla <- sum(!grepl(kingdom, phyla))

  class <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  class <- sum(!grepl(kingdom, class))

  order <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  order <- sum(!grepl(kingdom, order))

  family <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  family <- sum(!grepl(kingdom, family))

  genus <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  genus <- sum(!grepl(kingdom, genus))

  n.seqs <- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms <- c("Bacteria", "Archaea", "Eukaryota")
tax.levels <- c("phyla", "class", "order", "family", "genus", "n.seqs")

nr.file <- "silva.full_v132.tax"
nr.matrix <- matrix(rep(0,18), nrow=3)
nr.matrix[1,] <- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] <- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] <- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) <- kingdoms
colnames(nr.matrix) <- tax.levels
nr.matrix

#phyla class order family genus n.seqs
#Bacteria     78   203   546   1026  4018 218440
#Archaea      10    28    48     85   214  13127
#Eukaryota     0     0     0      0     0      0

It’s up now…

http://blog.mothur.org/2018/01/10/SILVA-v132-reference.files/