Tweaking databases to include custom sequences

This isn’t exactly a command issue, but this was the closest forum I could find.

I’m processing 454 16S bacterial data following the Costello example. I don’t know a lot about the total communities to expect from these samples, but there are a couple dominant species that I expect to show up every time, and several that are rarer but I know might appear and I’d like to be sure and recognize them when they do occur. Unfortunately, few of these sequences seem to show up in the silva database files available for download. The wiki says to just create your own databases, but while the “how to” bit may be obvious to some, it is less so to me.

For the silva.bacteria.fasta alignment database, I took my reference sequences and uploaded them to the silva alignment website (version SINA), got the fasta alignment files, and pasted them onto the end of the silva.bacteria.fasta file. Is this correct?

How many of the other files need to be modified to accomplish my goal? The secondary structure (silva.ss.map) and chimera slayer (silva.gold.align) files look like they could be left alone. The taxonomy file for OTU characterization looks like it needs to be altered, but I wasn’t sure how. Can I just add in the accession numbers and appropriate taxons, or is there another step? And I’m not certain about the nogap.bacteria.fasta file either …

Thanks in advance!

The taxonomy file for OTU characterization looks like it needs to be altered, but I wasn’t sure how. Can I just add in the accession numbers and appropriate taxons, or is there another step? And I’m not certain about the nogap.bacteria.fasta file either…

Right - the first column is the accession number and then the second column is the taxonomy information. In the second column the taxa are separated by semicolons - ‘;’, does not allow spaces, and ends in a semicolon.

As for the nogap.bacteria.fasta file - you can take your new silva.bacteria.fasta file and remove all of the .'s and -'s and that becomes you new nogap.bacteria.fasta file.

Hope this helps…
Pat

Sounds straight forward - thanks!

Hello Mothur forum!
I also am attempting to generate a custom database for Mothur but unfortunately, I’m running up against a wall. I hypothesize that the classifier doesn’t like my alignment file (the results are always “unknown(100)” - not a single OTU classified). I followed the logical instructions posted below by concatenating my sequences (aligned to SSURef_108_SILVA_NR_99_11_10_11_opt_v2.arb using the SINA aligner) to the end of the silva.bacteria.fasta file (used cat). The same file was used to generate the nogap.bacteria.fasta equivalent (using Perl to cut out the ‘.’ and ‘-’ chars). I then also created a tab delimited taxonomy file to add to the end of the silva.bacteria.rdp6.tax file. If anyone has successfully created their own custom database and can provide feedback to our process, I’d appreciate it!
Thanks,
Irene

Irene,

You said…

I then also created a tab delimited taxonomy file to add to the end of the silva.bacteria.rdp6.tax file.

Do you mean you put tabs between the different taxa or between the sequence name and the taxonomy? The different taxa should be separated by semicolons - “;”, not tabs. Can you post the first few lines of the taxonomy and fasta files you are using?

Also, are you using the most recent version of mothur? If not (you should!) are the sequences all pointed in the same direction? Have you tried classifying one of the reference sequences with your taxonomy?

Pat

Wow, what service! I didn’t expect such a prompt reply!
I am following the output of the original .tax file where the first column is the sequence identifier (or accession num), followed by a tab, followed by the semi-colon delimited taxonomic classification. Here is what it looks like:

HM112050.1 Bacteria;Proteobacteria;Gammaproteobacteria;Gamma-1;
HM112130.1 Bacteria;Proteobacteria;Gammaproteobacteria;Gamma-1;

My fasta file (the part I’m adding on to silva.bacteria.fasta) looks like this:

HM112050.1















-------------------------------------------------------------G-A-U-U–
G-A–A-C-G-C–U-G-G-C–G-G–C-A-G-G----------C----U-U–AA-C–AC-A----U
-G–C----A-A-G–U-C–GA-A-CG-----------G-UAA–C-A–U------------------
----------------G-A-G-U-G---------------------------------------------


----------------------------------------------------------CUU-G-------



---------------------------------------------C–A--C-U-U--------------
---------G-A-U----G–AC–G-------AG-U-G-G-C-GG-A–C-------------G-G—


--------------------------------------------------------------------G-
-U–G-A–G-U-A—AG-G-U-A–U-G-GG—G-A–U-CU-G–C-C-GAA–UG-G--------





--------------------------------------------------------------------A-
G–A----G–G--G—AC-AA-C–AG-------------------------U-U-G-----------







----------------------G–A-A-A-----------------------CGA-C-U-G-CU–A-A
-UA—C-C-U–C-A–U-A----------A--------------------------------------





--------------------------------A---------------
I am using the most recent version of mothur, having thought that versions were to blame for my issue, I went ahead and re-downloaded mothur. I believe that these fasta sequences are in the correct and identical orientation (having aligned them and double checked the alignment). Is it possible that I need to align the entire silva.bacteria.fasta file with the silva SSU database? Thanks for your help!

As a continuation of the discussion: I attempted to re-align the entire silva.bacteria.fasta file with the SINA aligner along with my custom sequences - same result. I am unable to use this file to classify my sequences – even sequences contained within the alignment. Any ideas what could be going on? I’d appreciate any help – one of the great things about Mothur is the fact that you can create your own databases!

A few more things…

First, you don’t need an alignment for classify.seqs.

Second, did you try to re-classify of the sequences in your training set and see if you get out what you put in?

Third, how long are the sequences you are trying to classify?

I see your point - I should try to narrow down at which point in the pipeline I’m getting a failure. I’m comparing the results of my custom database to that I get with the databases kindly provided by you for Mothur. The alignment process seems to work well - I get the same alignment output for both databases. I then use screen.seqs on the sequences based on that alignment. For classify.seqs, I took out the ‘.’ and ‘-’ chars from the database so that I can use that as a template and use my custom taxonomy file in the same command. To answer your questions:

2)) I am attempting to re-classify the sequences in my training set and am not able to get out what I put in – it cannot classify them at all and lists them as ‘unknown(100)’.
3) The sequences I am trying to classify (ultimately) are between 300 and 400 bp but the ones in my training set are between 1300 and 1500.

Thanks again for helping!

Can you send us a copy of the fasta and tax file and one of your sequences so we can take a look? - mothur.bugs@gmail.com

Dear Pat - thanks for all of your help - I figured out the problem. As usual, it came down to an issue in the characters found in the files (newline and carriage returns) When I modified the files to contain no newline/return characters (except for the newline after the fasta tagline) all was well. Thanks again - I look forward to using my new custom database!

Oh, good. Glad it’s working.

Dear Mothur helpers,
I am customizing the RDP training set following steps very similar to what has been discussed here. However, I am experiencing a different problem: although I only have six taxonomical levels associated with the sequences that I added to the RDP fasta and taxonomy file, Mothur is classifying my sequences with 8 taxonomical levels; i.e. after the “genus” level, Mothur gives me additional two levels, such as the following example:

Bacteria(100);Fusobacteria(100);Fusobacteriia(100);Fusobacteriales(100);Fusobacteriaceae(100);Fusobacterium(100);Fusobacterium_unclassified(100);Fusobacterium_unclassified_unclassified(100);

I’m not sure what is happening - has anybody experienced a similar problem?

In the my fasta file, sequences were labeled as:

Accession number
sequence

In the taxonomy file, I used the following format:
Accession number [tab space] Taxonomy;
The taxonomy was written following the same format as discussed earlier here, i.e. no spaces, everything separated by “;”, and another “;” at the end of each line. I checked several times and the sequences I added contain only six taxonomical levels.

I merged my files and the Mothur-formatted RDP files using merge.files command within Mothur.

The only difference I can see is that in the original mothur-formatted RDP fasta file each sequence label also contains taxonomical information, whereas the sequences that I added only contains the accession number in its label. But based on the discussion here, that shouldn’t matter - or should it?
Any advice will be very appreciated - Thank you!

Hi everyone,

If you have any experience with customizing the mothur-formatted RDP files please let me know. I am just wondering where else I should look for possible mistakes to fix. :slight_smile:

Thank you,

Pedro

Can you post this as a new query? This posting is several years old.

Pat