Silva custom database

Hi,

I have another question concerning a customized (updated) version of the SILVA database.

After reading a few other posts in this forum I created a taxonomy and nogap.fasta file (by parsing the SSURef_111_tax_silva_trunc.fasta file and basically splitting it into two, separately for Eukaryotes/Archaea/Bacteria). This is for the classify.seqs command.

Another thing which, I guess, should improve results, is an updated alignment. For this I took the file SSURef_111_tax_silva_full_align_trunc.fasta and basically reformatted it to remove spaces and line breaks within the sequence. This is for the align.seqs command.

Now, as a test, I wanted to reproduce the silva102-files for Eukaryotes provided by mothur. For this I performed same thing as above on the SSURef_102_… files downloaded from the silva archive. Basically, I expected that the
eukaryota.SSURef_102_SILVA_NR_99.tax file (produced by me) and the silva.eukarya.silva.tax file would be identical. However, my tax file has 31,809 entries, while the silva.eukarya.silva.tax file has only 1,238 entries. For example, the entry AB026819.394.2191 is missing from it compared to my file (which is based on SSURef_102_SILVA_NR_99). Now I’m a bit confused. How is this possible?

Thanks in advance…

Ok, reading documentation can always be a good idea. I guess this is the explanation for the reduction in size (from http://www.mothur.org/wiki/Silva_reference_files):

“The actual reference alignment that SILVA uses with their SINA aligner is called the SEED alignment. We don’t know what this actually is. We have tried to duplicate it by identifying the unique sequences in the SSURef database (v102) that have a 100% quality score to the SEED alignment and that go from the end of the traditional 8f/27f primer to the beginning of the traditional 1492r primer.”

There are two restrictions here (100% quality score to the SEED alignment, go from the end of the traditional 8f/27f primer to the beginning of the traditional 1492r primer) which must cause this reduction in size.

Should I proceed the same way to create my custom database?

By the way, a huge thank-you to Pat Schloss for mothur, which is really a beautiful data analysis tool.