RDP, SILVA, GreenGenes

Hello Mothur,

I wanted to know which classification is better RDP, SILVA, GreenGenes and why. Can we use ARB-silva file directly for classification or only mothur modified files for classification should be used.

Thanks

Our lab was wondering the same thing a while ago, so I made a mock community of typestrains from SILVA and put them through each database.

In our experience, the RDP taxonomy is the most accurate, but it doesn’t include candidate phyla - so if you expect your sample to contain members of groups like TM7, Poribacteria or whatever, it won’t be much good. Greengenes is only fractionally less accurate though (checking my notes, the mothur RDP7/pds database had a misclassification rate of 0.05% and Greengenes 0.27%). The SILVA taxonomy was 100% accurate, although that’s probably because we were comparing against their taxonomy to begin with :mrgreen:. Also, the percentage of reads that couldn’t be classified were 0.17% (RDP), 1.72% (Greengenes), and 5.76% (SILVA, but our community included some archaea, which I don’t believe the SILVA taxonomy includes).

So my answer (and I’m sure others will have different advice) would be to stick to RDP or Greengenes.

And for your second question, you could make your own classification database from ARB, you just need to export the unaligned sequences as a fasta file, then export the taxonomy information in the format that mothur expects. I think I’ve posted an export mask for that somewhere on these forums before, but it’s not too hard to make. If you’re making your own database though, just be careful that it’s actually accurate - run mock data through it first to make sure it’s accurate before putting your unknown reads through it.

You might also want to check out http://www.ncbi.nlm.nih.gov/pubmed/21716311. I think that for non host-associated communities the greengenes is likely best. When we run mouse/human sequence sets through we get very similar classifications or at least they’re equally poor. We have tried to replicate the gg database used in the above link and it is posted at http://www.mothur.org/wiki/Greengenes-formatted_databases.

I have more technical question. Do those curated databases carry only 1 sequence of rRNA gene per specie? Or if a bacterium has 12 copies of rRNA gene with high intra-variation, will all 12 copies of 16S rDNA will be found in those databases? Does those 16S rRNA genes have to be experimentally confirmed to be active before they get into those databases.

The reason I am asking is: because we are amplifying 16S RNA gene before sequencing, I would imagine that we get all 12 copies amplified by universal primers. Some copies might not belong to functional 16S rRNA and might not be present in taxonomy databases, but those copies will still represent the bacterium we extracted this gDNA from.

Sorry. I could probably find this information by digging up the literature, but I figured it will be easier to ask people who work with those databases and know the difference.

Regards,
–Yury

Some taxa have one sequence, some have many.

Hi, I would like to know if there is a database already done in which GreenGenes and pds (RDP with the extra info) are together!! that would be very nice, cause I’ve had sequences that were classifies with RDP but not GreenGenes and vice versa. If somebody has it would be nice if you can pass it to me :slight_smile: Cheers