I’ve exported a Silva v128 database from ARB that includes the taxonomy field ‘tax_slv’ as well as the name the ‘full_name’ for the 190,661 sequences. I did this so that I can (in most cases) get the species or strain name for the sequence - instead of just a taxonomy truncated to genus. I recognize that in some cases, the ‘full_name’ will be misleading, e.g., ‘full_name’ will refer to a eukaryote that was targeted for genome sequencing when the reported sequence was bacterial - but I think I can deal with this.
In the README here: http://blog.mothur.org/2017/03/22/SILVA-v128-reference-files/ there is a step in which you run R code (provided by Eric Collins) to map the taxa to 6 Linnean levels, a process that starts with reading in the Silva mapping file ‘tax_slv_ssu_128.txt’. This may seem obvious, but when I run the R code on the modified taxonomy file (including ‘full_name’) the process fails and no taxonomy string is returned - because the code looks for exact matching of the text in the tax_slv_ssu_128.txt file and the taxonomy string of the 190,661 sequences. It fails because I added the ‘full_name’ to the taxonomy. I’ve checked, and the folks at ARB/Silva do not have a version of tax_slv_ssu_128.txt that includes species or strain.
The reason I’m doing all of this is to try to get better identification for OTUs from harmful algal species and cyanobacteria. The cyanos are problematic as some of the OTUs are simply identified as ‘Family I’, but when you BLAST the sequences that comprise them, they’re identified to at least Genus if not species.
Can anyone suggest a work-around for including species in the v128 reference files and then getting this taxonomy to work in mothur?