I’m going through and updating my Silva and RDP files to the latest version and adding in the extra sequences I like to have in the databases. When I went to paste the extra alignments into the silva.bacteria.fasta file, I found a mixture of ---- and … space holders. This is probably a silly question, but what’s the difference and how should I format the new alignments I want to add?
So the sequences in silva.bacteria.fasta are all aligned and the alignment length is 50,000 characters long. The 16S gene is only 1500 letters long. By some forgotten standard, .'s indicate missing data and -'s indicate a gap within an alignment. So in silva.bacteria.fasta the .'s come at the beginning and end of the alignments (recall the use of trump=. in filter.seqs). The -'s come within the middle of a sequence to make sure that all the columns align in an evolutionarily consistent manner (i.e. positional homology). So if you add a sequence to silva.bacteria.fasta, it needs to be aligned to the same alignment. This can easily be done by downloading whatever extra reference sequences you want from the silva website.
Thanks for the help, Pat! The process has changed a bit since I last updated my files.
For anyone else that stumbles across this, go to the Silva browser http://www.arb-silva.de/browser/ and find your sequence. Then click on the little green cart next to the name of your bacterium to add it to your cart. Repeat for each sequence you want to add. When you finish, click on Download under Cart in the upper right hand corner of the screen and choose “Fasta with gaps” and whatever compression of the file you wish. Click “start export” to download. Looks like you’ll need to load the file into a plain text editor to remove line breaks, white space, and excessive descriptions. Search and replace to change the U’s back to T’s. Then copy and paste to the end of the original file.