Formatting Silva v128 reference taxonomy

Hi,

I’ve exported a Silva v128 database from ARB that includes the taxonomy field ‘tax_slv’ as well as the name the ‘full_name’ for the 190,661 sequences. I did this so that I can (in most cases) get the species or strain name for the sequence - instead of just a taxonomy truncated to genus. I recognize that in some cases, the ‘full_name’ will be misleading, e.g., ‘full_name’ will refer to a eukaryote that was targeted for genome sequencing when the reported sequence was bacterial - but I think I can deal with this.

In the README here: http://blog.mothur.org/2017/03/22/SILVA-v128-reference-files/ there is a step in which you run R code (provided by Eric Collins) to map the taxa to 6 Linnean levels, a process that starts with reading in the Silva mapping file ‘tax_slv_ssu_128.txt’. This may seem obvious, but when I run the R code on the modified taxonomy file (including ‘full_name’) the process fails and no taxonomy string is returned - because the code looks for exact matching of the text in the tax_slv_ssu_128.txt file and the taxonomy string of the 190,661 sequences. It fails because I added the ‘full_name’ to the taxonomy. I’ve checked, and the folks at ARB/Silva do not have a version of tax_slv_ssu_128.txt that includes species or strain.

The reason I’m doing all of this is to try to get better identification for OTUs from harmful algal species and cyanobacteria. The cyanos are problematic as some of the OTUs are simply identified as ‘Family I’, but when you BLAST the sequences that comprise them, they’re identified to at least Genus if not species.

Can anyone suggest a work-around for including species in the v128 reference files and then getting this taxonomy to work in mothur?

Thanks,
Pete

Hi Pete,

My colleague Tim & I are facing a similar issue: we have PacBio SMRTbell data of both full-length bacterial and archaeal primersets and want to add at least a hint of species-level classification, which we believe should be possible if you have an average read length of 1500 nucleotides of the 16S which has been covered at least 4 (but on average a lot more) times.

I will have a look at the readme and try to adapt it myself. I’ve been through the process in the past and I know it is not always foolproof but I am quite competent in using Arb & R so I think it should be feasible.

Furthermoe, I was wondering: would you be willing to share the Arb export filter you used (*.eft) with me?

Kind regards,

FM Kerckhof

Hi FM,

Here’s the *.eft file I’ve been using to extract the full name with the taxonomy…

SUFFIX fasta
BEGIN

(acc).(name)\t*(align_ident_slv)\t*(tax_slv);*(full_name)
*(|export_sequence)

Cheers,

Pete