Impact of training sets on classification of high-throughput

Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys
Jeffrey J Werner, Omry Koren, Philip Hugenholtz, Todd Z DeSantis, William A Walters J Gregory Caporaso, Largus T Angenent, Rob Knight and Ruth E Ley
The ISME Journal advance online publication, 30 June 2011; doi:10.1038/ismej.2011.82

What are your thoughts?

I have been using the RDP6 training set and I’m looking for a particular species in my data set. I’ve designed a FISH probe that, fortunately, is in the region that gets sequenced in my sequencing data set. However, I never picked it up which made me look deeper into the training sets and I realise it goes down to genus not species (i realise its hard to get species level with such short reads). There are about 10 sequences of this genus of interest in the RDP6 training set (thus have exactly the same taxonomy in terms of classify.seqs in mothur). When i seqmatch these in RDP they each hit the type strain of the species that the sequence comes from… thus the sequences (in this case) should be able to tell me something species level information provided its in the training set, no? I’ve tried adding an extra line in the taxonomy and fasta files hoping the sequence would get allocated but it didn’t (there is probably, most likely something i dont understand about the algorithm)

Shaun

that’s actually a nice paper and it’s well written.

I’ve been wondering about the various training sets. The results can be quite different for the same dataset. Perhaps this is made worse by short sequences. I work with 60-nt illumina reads from the V6 region. I thought RDP could be my best choice because the proportion of unclassifieds is smaller than with SILVA; 18% vs 36% using a 60% cutoff

We tried our best to reconstruct the Werner database, but the numbers aren’t dead on for some reason. Regardless, we’re providing a mothur-formatted version…

http://www.mothur.org/wiki/Greengenes-formatted_databases

Hello

So as I see everyone else is also wondering which training set is best, and I guess trying all three is the only answer really. As read in the same paper by Werner et al. they suggest that using reference training sets that were trimmed to the specific primer regions to classify your query sequences gave better results then using full length reference sequences. Has anyone else tried this in mothur and does it indeed give better results? would it be worth it to try the difference between full length reference sequences and region specific sequences. Is it possible that a version of this training set be made compatible with mothur, I see that greengene and SILVA have made this resource available on http://www.hmpdacc.org/HMMCP/. Can you guys maybe convert the files to be usable on Mothur? or is this a bit too much to ask? Also Werner’s paper suggests that the greengenes database is actually better to use than the other two databases due to its higher diversity, why does mothur keep suggesting to use RDP rather?

Thanks in advance for the help :slight_smile:

We’ve done this, much the same way we do it for the alignment-specific database. If you were doing the RDP you would…

  1. Align the RDP fasta file against the SILVA reference
  2. Run screen.seqs on the RDP aligned file and taxonomy file for the region you want

Viola, a region specific reference :slight_smile:

Pat

Hello

I see what you mean :). Thank you. :lol: