Using silva as reference database in MiSeq SOP

Hi,

I’m planning on using the silva database for my metagenomic analysis and was hoping to get a few things clarified -

  1. The silva.bacteria.fasta currently available in the MiSeq SOP was compiled using which version of the silva database? Also apart from this, where can I get the .fasta files for the database?

  2. The silva v132 full length database downloaded from https://mothur.org/wiki/silva_reference_files/, has only a .align file and a .tax file. Can I use the .align file instead of the fasta file for customizing the database using pcr.seqs and use that output in the “reference=” parameter instead of the fasta file as instructed in the MiSeq SOP?

  3. I would like someone to verify if my understanding is correct - the full length database available in the above link is basically the silva full database aligned using a few sequences that are available under the SEED database. What is the purpose of doing do? And, if I were to use these 2 databases for aligning a query sequence how would I go about doing the same?

Thanks in advance for all the help.

EDIT: Additionally I want to know why in MiSeq SOP, the alignment is done with Silva, while the Bayesian classifier is done with RDP. Is there any dis-advantage of using Silva itself?

2 Likes

Just the easy replies:

The align is the fasta version, aligned. I am not sure what fasta you are asking about. Then you can PCR seqs on the align - I think this is how it is explained elsewhere how to do it.

Not sure about the alignment procedure, but having the alingment is the best way to then compare all sequences to each other… And to know where are the limits of your sequences, if they are missing the beginning or the end… And many more things. The classifier uses RDP, that is a method, against the SILVA database. Again, I am not sure what you are asking here.

Thanks for your reply. The MiSeq SOP mentions the use of the silva database as a .fasta format file, which is customized using the pcr.seqs and then aligned with the query sequences. But the reference files available for download had only .align files, so I was just wondering if the .align and .fasta files are equivalent and can be used in place of the other.
My other query was again in the MiSeq SOP, the alignment step uses the silva reference sequences, but in the classify.seqs step the RDP files are used. The silva database contains its own .tax file, why is the RDP tax and sequence files prefered over the silva’s one, especially considering that silva was used in the previous step?

I am not sure of the SOP - sorry. I will leave the real MOTHUR people answer that question : )

But, yes, the align file is a fasta file.

For alignment (i.e. align.seqs), there’s no suitable to using the silva reference alignments to align your sequences. greengenes’s and rap’s alignments are horrible. For classification (i.e. classify.seqs) you can use whatever reference you want. Some prefer silva or greengenes to RDP because they are larger and have more information for as yet uncultured taxa. Some prefer the RDP because its taxonomy is based on the authoritative Bergey’s taxonomy. Some prefer silva over greengenes because silva is still getting updated whereas greengenes is not.

As was mentioned, *.align files are fasta files. You’ll know something is a fasta file if the first line for each sequence starts with a > character. You’ll know something is aligned if you see . and - characters in the sequence data and if the sequences are all the same length.

Pat

1 Like

Thanks so much. That clears up my doubts.

I can relate to the point made by arvalve. It seems odd that the output of align.seqs is .align and not .align.fasta. I remember being confused when I started using mothur. When you enter the aligned file into filter.seqs, filter asks for a fasta file, yet your input ends in .align. Perhaps a lack of consistency that might be worth fixing?

Giovanni Widmer