Taxonomic database implementation


I tried classifying my sequences using classify.seqs but realized that many of them were unclassified at the species level.
I nBLASTed these sequences against nr database in NCBI and got identification results with 100% identity to specific species (mostly P. acnes).
I looked through the .taxonomy files supplied with mothur and found that there were no sequences for this species and many others.
Is there a way to implement the database with new sequences?

Sure, you’ll have to modify the taxonomy file to add species names. The RDP only supplies taxonomies to the genus level.

Patrick, would you be so kind to provide some pointers how to modify this training set to include species-level classification?
Has anyone done it before? It could save a lot of double work if one user made this modification available :wink: .

Kind regards,

FM Kerckhof


Yes, I know that people have done this. Here’s what the RDP training set might have in the taxonomy file…

J01695_S001099426 Bacteria;“Proteobacteria”;Gammaproteobacteria;“Enterobacteriales”;Enterobacteriaceae;Escherichia_Shigella;

Here's how you would change it...
J01695_S001099426 Bacteria;"Proteobacteria";Gammaproteobacteria;"Enterobacteriales";Enterobacteriaceae;Escherichia_Shigella;Escherichia_coli;

You would probably have to add sequences to both the taxonomy and fasta files so that there are multiple Escherichia_coli sequences, etc. FWIW, the greengenes database does have species-level data included for some sequences.

Another way to classify your sequences to species-level is to extract sequences from the genera of interest and then do phylogenetic analysis with the inclusion of sequences of the type strains from the genus of interest. This way is more robust than the automatically classification of sequences, especially then dealing with medically important bacteria like streptococcus where species belonging to the same genus may be very closely related according to the 16S rRNA sequences.