Custom taxonomic database for CO1 (or non rRNA gene)

Hi,
I imagine this has been asked before, but I am having a hard time finding for it on the forum. If I wanted to create a custom taxonomic database (for assignment, alignment, etc in Mothur) with similar data/functionality to the Silva database, release 128 (i.e. with complete taxonomy information, as available, from genus to phylum that was formatted appropriately for Mothur ) is there a set of Mothur commands to accomplish that?

I found the

pcr.seqs

command, but it looks like it simply reduces the alignment to the region of interest in the 16S (or whichever) gene.

Have users developed alternative custom reference databases from NCBI genbank (say for the C01 mitochondrial gene in metazoans, or ITS region)? Getting a set of reference FASTA files is easy, but generating the complete taxonomy is less so, but doable with the ncbi tax dmp. But formatting the fasta sequences with the complete taxonomy is tricky, and Im not sure what the format of that database should be?

Any help appreciated.

LP

There are no commands in mothur for doing this, but as you say, any genetic resource you find should come with approximately mothur-compatible taxonomy strings. If you have a specific gene of interest you might find people have experience getting that data into mothur. For example, I’ve done this with COI sequences, where I just got the data from BOLD database. BOLD has the sequences, and also taxonomy identifiers that can be massaged into the format mothur requires.

If you’re interested in ITS data, the UNITE project releases mothur compatible databases for identification as well.

Thanks dwaite. It is actually CO1 that I am using and I have tried to download/work with the latest release of the IBOL database (6.50) to create such a database - *seems I actually need to download ALL releases, not just the latest?

Anyway, could you provide some guidelines or code you used to format BOLD/IBOL CO1 sequences into a mother-compatable taxonomy file? And perhaps an example header from the resultant database file?

No problem. For my project I was interested in the Hemiptera, so only downloaded that part of the database. I just used the Barcode Index Number page and search tools to narrow down to my organisms of interest. Once I had those, I downloaded the sequences (Sequences -> Fasta) and the metadata (Specimen Data -> TSV).

The taxonomic information is in the tsv file in columns 9, 11, 13, 15, 17, 19, and 21 so I just extracted that out and pushed them into a mothur taxonomy format using some basic linux commands:

# Get the sequence names
cut -f1 bold_data.txt | tail -n+2 > seqNames.txt

# Get the taxonomy strings
cut -f9,11,13,15,17,19,21 bold_data.txt | tail -n+2 | sed 's/\t/;/g' > tax.txt

# Stick them together
paste seqNames.txt tax.txt > taxonomy.txt

There are a few things you then need to fix up with this.

  1. There are often hyphens in sequence names - these need to be changed in both the fasta and taxonomy file, because mothur can have issues with them
  2. There are some spaces and empty fields in the taxonomy strings. These need to be replaced (spaces) and removed (empty strings) as your question requires
  3. You taxonomy file needs a trailing ‘;’ on each line. I just did this by opening the file in Geany (or Notepad++, whatever you have) and replacing ‘\n’ with ‘;\n’

Hope that helps, but I’ll stress that for problem 2 there is not universal fix - you need to decide how to deal with the empty fields/strings depending on how it affects your data.

thanks dwaite.
I tried to do some testing of just one phylum (maxillapoda) downloaded from BOLD and formatted similar to how you suggested. I had to add an “;” to all lines of the tax file that did not have one (or mothur flagged as error). And I had to make the ids from the taxonomy vs fasta database line up. But, after using some perl to only keep lines with common ids, I still have this issue where during classify.seqs, mothur can’t seem to find the ids of the taxonomy in the fasta

classify.seqs (fasta= , template=, reference=)

'GBCX4624-15' is in your template file and is not in your taxonomy file. Please correct.
'SLAVA126-11' is in your template file and is not in your taxonomy file. Please correct.
'BIPOL456-10' is in your template file and is not in your taxonomy file. Please correct.
'GBA11172-13' is in your template file and is not in your taxonomy file. Please correct.
DONE.
It took 3 seconds get probabilities.

note: file names in classify.seqs left out for simplicity- they do exist of course and there are snippets of them below

Anyway, when I run this, it completes, but I only get the tax.sum file and no .taxonomy file. Why would I get the tax.sum ?

FYI, it looks like this:

~/Downloads# head Final.fix.tree.sum 
#1.39.5
2750
7
0 3
Root
2 Arthropoda
1208 unclassified
1 unknown

1 1

Anyway, when I grep those seq names, they are indeed in the taxonomy file, so that is Puzzling.

~/Downloads# grep -n GBCX4624-15 Final.fix.tax
18493:GBCX4624-15 Arthropoda;Maxillopoda;Sessilia;Archaeobalanidae;Acastinae;Conopea;Conopea sp.

Here is what my final tax/template files look like:

~/Downloads# head Final.fix.tax
GBA1955-07 Arthropoda;Maxillopoda;Sessilia;Balanidae; ;Balanus;Balanus glandula;
GBA1983-07 Arthropoda;Maxillopoda;Sessilia;Balanidae; ;Balanus;Balanus glandula;
GBA2013-07 Arthropoda;Maxillopoda;Sessilia;Balanidae; ;Balanus;Balanus glandula;
GBA4315-09 Arthropoda;Maxillopoda;Sessilia;Balanidae; ;Balanus;Balanus glandula;
GBA4369-09 Arthropoda;Maxillopoda;Kentrogonida;Sacculinidae; ;Heterosaccus;Heterosaccus californicus;
GBCX0185-06 Arthropoda;Maxillopoda;Calanoida;Metridinidae; ;Metridia;Metridia gerlachei;
GBCX0186-06 Arthropoda;Maxillopoda;Calanoida;Pseudodiaptomidae; ;Pseudodiaptomus;Pseudodiaptomus nihonkaiensis;
GBCX0194-06 Arthropoda;Maxillopoda;Calanoida;Temoridae; ;Eurytemora;Eurytemora pacifica;
GBCX0195-06 Arthropoda;Maxillopoda;Calanoida;Pontellidae; ;Labidocera;Labidocera rotunda;
GBCX0196-06 Arthropoda;Maxillopoda;Calanoida;Tortanidae; ;Tortanus;Tortanus dextrilobatus;
~/Downloads# head Final.fasta
>GBA1955-07
------------------------------------------------------------
------------------------------------CTTATTCGGGCTGAACTTGGTCAA
CCAGGTAGACTGATTGGAGAT---GATCAGATTTACAATGTAATTGTTACTGCTCATGCT
TTTATTATGATTTTTTTCATAGTTATACCTATTATAATTGGGGGTTTTGGTAATTGATTA
CTTCCATTAATATTAGGAGCTCCTGATATAGCTTTTCCACGTCTTAATAATATAAGTTTT
TGGCTATTACCCCCAGCTTTAATATTGTTGATTAGAGGATCATTAGTAGAAGCTGGAGCT
GGTACTGGATGGACAGTTTACCCTCCTTTATCGAGAAATATTGCCCATTCAGGAGCATCG
GTAGATTTATCTATTTTTTCTCTCCATTTAGCTGGAGCTTCATCTATTCTTGGGGCCATT
AATTTTATATCGACAGTTATTAAT------------------------------------

Now, I know there are some tax lines that end strangely, with numbers, and lack the “;” to end the line. But, it appears they are ‘ignored’ by mothur. Anyway, Im not sure where to go from here. Are these tax/template files close? OR why might

classify.seqs

be failing?

Hm, could you try removing the spaces? I don’t know if they cause problems, but all the entries in the typical 16S databases avoids them.

dwaite, I was just about to ask if anyone had adapted BOLD database for mothur. thanks for your detailed instructions!