Creating a 23S Database

Hello,

I am attempting to make a custom 23S database, and I am having the hardest time getting it to work. I have been following the instructions listed here: https://mothur.org/blog/2020/SILVA-v138-reference-files/. (Currently using Silva v132 to make it comparable to a previous 16S analysis).

After running the fasta file through ARB, I ran the following:

$ mothur “#screen.seqs(fasta=silva.full_v132_LSU.fasta, start=115034, end=119675, maxambig=5, processors=8);
pcr.seqs(start=115034, end=119675, keepdots=T);
degap.seqs();
unique.seqs();”

#identify the unique sequences without regard to their alignment
$ grep “>” silva.full_v132_LSU.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.full_v132_LSU.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
$ mothur “#get.seqs(fasta=silva.full_v132_LSU.good.pcr.fasta, accnos=silva.full_v132_LSU.good.pcr.ng.unique.accnos)”

#generate alignment file
$ mv silva.full_v132_LSU.good.pcr.pick.fasta silva.nr_v132_LSU.align

This file, which ends up being the reference file used in align.seqs, looks like this:

KT989341.GycFall2 72.91 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Annelida;Polychaeta;

CP016896.FYOASo94 99.72 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;

The whole string is …s; there are no letters anywhere (from looking at my 16S one, I have a feeling this might not be right).

#generate taxonomy file
$ grep ‘>’ silva.nr_v132_LSU.align | cut -f1,3 | cut -f2 -d’>’ > silva.nr_v132_LSU.full

This .full file appears to just be a list of names? And so I am not sure if that is correct, but it is the file used to make the .tax file. The file looks like this:

KT989341.GycFall2
CP016896.FYOASo94
CP009575.GJRIva57
CP012635.Esc13588

Then in R, to make the taxonomy file:

map.in ← read.table(“tax_slv_lsu_132.txt”,header=F,sep=“\t”,stringsAsFactors=F)
map.in ← map.in[,c(1,3)]
colnames(map.in) ← c(“taxlabel”,“taxlevel”)
map.in ← rbind(map.in, c(“Bacteria;RsaHf231;”, “phylum”)) #wasn’t in tax_slv_ssu_138.txt

#fix Escherichia nonsense
map.in$taxlevel[which(map.in$taxlabel==“Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;”)] ← “genus”

taxlevels ← c(“root”,“domain”,“major_clade”,“superkingdom”,“kingdom”,“subkingdom”,“infrakingdom”,“superphylum”,“phylum”,“subphylum”,“infraphylum”,“superclass”,“class”,“subclass”,“infraclass”,“superorder”,“order”,“suborder”,“superfamily”,“family”,“subfamily”,“genus”)
taxabb ← c(“ro”,“do”,“mc”,“pk”,“ki”,“bk”,“ik”,“pp”,“ph”,“bp”,“ip”,“pc”,“cl”,“bc”,“ic”,“po”,“or”,“bo”,“pf”,“fa”,“bf”,“ge”)
tax.mat ← matrix(data=“”,nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] ← “root”
colnames(tax.mat) ← taxlevels

outlevels ← c(“domain”,“phylum”,“class”,“order”,“family”,“genus”)

for(i in 1:nrow(map.in)) {
taxname ← unlist(strsplit(as.character(map.in[i,1]), split=‘;’))
#print(taxname);

while ( length(taxname) > 0) {
#regex to look for exact match

tax.exp <- paste(paste(taxname,collapse=";"),";",sep="")
tax.match <- match(tax.exp,map.in$taxlabel)
tax.mat[i,map.in[tax.match,2]] <- tail(taxname,1)
taxname <- head(taxname,-1)

}
}

for(i in 1:nrow(tax.mat)) {
#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
#if you don’t want this behavior, cut it out
for(j in 1:ncol(tax.mat)) {
if(tax.mat[i,j] < 0) { tax.mat[i,j] ← paste(tmptax,taxabb[j],sep=“_”)}
else { tmptax ← tax.mat[i,j]}
}

#this maps the new name to the input taxonomic levels
map.in[i,“taxout”] ← paste(paste(tax.mat[i,outlevels],collapse=“;”),“;”,sep=“”)
}

replace spaces with underscores

map.in$taxout ← gsub(" “,”_",map.in$taxout)

bring in the old taxonomic levels from SILVA and remap them using the new levels

tax.in ← read.table(“silva.nr_v132_LSU.full”,header=F,stringsAsFactors=F,sep=“\t”)
colnames(tax.in) ← c(“taxid”,“taxlabel”)

Following line corrects the Bacteria;Bacteroidetes;Bacteroidia;Flavobacteriales;Flavobacteriaceae;Polaribacter;Polaribacter; problem

tax.in$taxlabel ← gsub(“Polaribacter;Polaribacter;”, “Polaribacter;”, tax.in$taxlabel)
tax.in$taxlabel ← gsub(“;[[:space:]]+$”, “;”, tax.in$taxlabel)

tax.in$id ← 1:nrow(tax.in)

tax.write ← merge(tax.in,map.in,all.x=T,sort=F)
tax.write ← tax.write[order(tax.write$id),]

#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth ← function(taxonString){
initial ← nchar(taxonString)
removed ← nchar(gsub(“;”, “”, taxonString))
return(initial-removed)
}

depth ← getDepth(tax.write$taxout)
summary(depth) #should all be 6 and there should be no NAs
bacteria ← grepl(“Bacteria;”, tax.write$taxout)
archaea ← grepl(“Archaea;”, tax.write$taxout)
eukarya ← grepl(“Eukaryota;”, tax.write$taxout)

tax.write[depth > 6 & bacteria,] #if zero, we’re good to go
tax.write[depth > 6 & archaea,] #if zero, we’re good to go
tax.write[depth > 6 & eukarya,] #if zero, we’re good to go

write.table(tax.write[,c(“taxid”,“taxout”)],file=“silva.full_v138.tax”,sep=“\t”,row.names=F,quote=F,col.names=F)

At the end, I get a taxonomy file that has NAs in it:

KT989341.GycFall2 NA
CP016896.FYOASo94 NA
CP009575.GJRIva57 NA

I have no idea what I am doing wrong, and I would appreciate any help! Thanks!