Creating a 23S Database


I am attempting to make a custom 23S database, and I am having the hardest time getting it to work. I have been following the instructions listed here: (Currently using Silva v132 to make it comparable to a previous 16S analysis).

After running the fasta file through ARB, I ran the following:

$ mothur “#screen.seqs(fasta=silva.full_v132_LSU.fasta, start=115034, end=119675, maxambig=5, processors=8);
pcr.seqs(start=115034, end=119675, keepdots=T);

#identify the unique sequences without regard to their alignment
$ grep “>” | cut -f 1 | cut -c 2- >

#get the unique sequences without regard to their alignment
$ mothur “#get.seqs(fasta=silva.full_v132_LSU.good.pcr.fasta,”

#generate alignment file
$ mv silva.full_v132_LSU.good.pcr.pick.fasta silva.nr_v132_LSU.align

This file, which ends up being the reference file used in align.seqs, looks like this:

KT989341.GycFall2 72.91 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Annelida;Polychaeta;

CP016896.FYOASo94 99.72 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;

The whole string is …s; there are no letters anywhere (from looking at my 16S one, I have a feeling this might not be right).

#generate taxonomy file
$ grep ‘>’ silva.nr_v132_LSU.align | cut -f1,3 | cut -f2 -d’>’ > silva.nr_v132_LSU.full

This .full file appears to just be a list of names? And so I am not sure if that is correct, but it is the file used to make the .tax file. The file looks like this:


Then in R, to make the taxonomy file: <- read.table(“tax_slv_lsu_132.txt”,header=F,sep="\t",stringsAsFactors=F) <-[,c(1,3)]
colnames( <- c(“taxlabel”,“taxlevel”) <- rbind(, c(“Bacteria;RsaHf231;”, “phylum”)) #wasn’t in tax_slv_ssu_138.txt

#fix Escherichia nonsense$taxlevel[which($taxlabel==“Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;”)] <- “genus”

taxlevels <- c(“root”,“domain”,“major_clade”,“superkingdom”,“kingdom”,“subkingdom”,“infrakingdom”,“superphylum”,“phylum”,“subphylum”,“infraphylum”,“superclass”,“class”,“subclass”,“infraclass”,“superorder”,“order”,“suborder”,“superfamily”,“family”,“subfamily”,“genus”)
taxabb <- c(“ro”,“do”,“mc”,“pk”,“ki”,“bk”,“ik”,“pp”,“ph”,“bp”,“ip”,“pc”,“cl”,“bc”,“ic”,“po”,“or”,“bo”,“pf”,“fa”,“bf”,“ge”)
tax.mat <- matrix(data="",nrow=nrow(,ncol=length(taxlevels))
tax.mat[,1] <- “root”
colnames(tax.mat) <- taxlevels

outlevels <- c(“domain”,“phylum”,“class”,“order”,“family”,“genus”)

for(i in 1:nrow( {
taxname <- unlist(strsplit(as.character([i,1]), split=’;’))

while ( length(taxname) > 0) {
#regex to look for exact match

tax.exp <- paste(paste(taxname,collapse=";"),";",sep="")
tax.match <- match(tax.exp,$taxlabel)
tax.mat[i,[tax.match,2]] <- tail(taxname,1)
taxname <- head(taxname,-1)


for(i in 1:nrow(tax.mat)) {
#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
#if you don’t want this behavior, cut it out
for(j in 1:ncol(tax.mat)) {
if(tax.mat[i,j] < 0) { tax.mat[i,j] <- paste(tmptax,taxabb[j],sep="_")}
else { tmptax <- tax.mat[i,j]}

#this maps the new name to the input taxonomic levels[i,“taxout”] <- paste(paste(tax.mat[i,outlevels],collapse=";"),";",sep="")

replace spaces with underscores$taxout <- gsub(" “,”_",$taxout)

bring in the old taxonomic levels from SILVA and remap them using the new levels <- read.table(“silva.nr_v132_LSU.full”,header=F,stringsAsFactors=F,sep="\t")
colnames( <- c(“taxid”,“taxlabel”)

Following line corrects the Bacteria;Bacteroidetes;Bacteroidia;Flavobacteriales;Flavobacteriaceae;Polaribacter;Polaribacter; problem$taxlabel <- gsub(“Polaribacter;Polaribacter;”, “Polaribacter;”,$taxlabel)$taxlabel <- gsub(";[[:space:]]+$", “;”,$taxlabel)$id <- 1:nrow(

tax.write <- merge(,,all.x=T,sort=F)
tax.write <- tax.write[order(tax.write$id),]

#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth <- function(taxonString){
initial <- nchar(taxonString)
removed <- nchar(gsub(";", “”, taxonString))

depth <- getDepth(tax.write$taxout)
summary(depth) #should all be 6 and there should be no NAs
bacteria <- grepl(“Bacteria;”, tax.write$taxout)
archaea <- grepl(“Archaea;”, tax.write$taxout)
eukarya <- grepl(“Eukaryota;”, tax.write$taxout)

tax.write[depth > 6 & bacteria,] #if zero, we’re good to go
tax.write[depth > 6 & archaea,] #if zero, we’re good to go
tax.write[depth > 6 & eukarya,] #if zero, we’re good to go


At the end, I get a taxonomy file that has NAs in it:

KT989341.GycFall2 NA
CP016896.FYOASo94 NA
CP009575.GJRIva57 NA

I have no idea what I am doing wrong, and I would appreciate any help! Thanks!