RDP, SILVA, GreenGenes

Rich · June 10, 2013, 10:18am

Hello Mothur,

I wanted to know which classification is better RDP, SILVA, GreenGenes and why. Can we use ARB-silva file directly for classification or only mothur modified files for classification should be used.

Thanks

dwaite · June 10, 2013, 9:21pm

Our lab was wondering the same thing a while ago, so I made a mock community of typestrains from SILVA and put them through each database.

In our experience, the RDP taxonomy is the most accurate, but it doesn’t include candidate phyla - so if you expect your sample to contain members of groups like TM7, Poribacteria or whatever, it won’t be much good. Greengenes is only fractionally less accurate though (checking my notes, the mothur RDP7/pds database had a misclassification rate of 0.05% and Greengenes 0.27%). The SILVA taxonomy was 100% accurate, although that’s probably because we were comparing against their taxonomy to begin with :mrgreen:. Also, the percentage of reads that couldn’t be classified were 0.17% (RDP), 1.72% (Greengenes), and 5.76% (SILVA, but our community included some archaea, which I don’t believe the SILVA taxonomy includes).

So my answer (and I’m sure others will have different advice) would be to stick to RDP or Greengenes.

And for your second question, you could make your own classification database from ARB, you just need to export the unaligned sequences as a fasta file, then export the taxonomy information in the format that mothur expects. I think I’ve posted an export mask for that somewhere on these forums before, but it’s not too hard to make. If you’re making your own database though, just be careful that it’s actually accurate - run mock data through it first to make sure it’s accurate before putting your unknown reads through it.

pschloss · June 12, 2013, 12:02pm

You might also want to check out http://www.ncbi.nlm.nih.gov/pubmed/21716311. I think that for non host-associated communities the greengenes is likely best. When we run mouse/human sequence sets through we get very similar classifications or at least they’re equally poor. We have tried to replicate the gg database used in the above link and it is posted at http://www.mothur.org/wiki/Greengenes-formatted_databases.

Yury_Ivanov · June 14, 2013, 2:23pm

I have more technical question. Do those curated databases carry only 1 sequence of rRNA gene per specie? Or if a bacterium has 12 copies of rRNA gene with high intra-variation, will all 12 copies of 16S rDNA will be found in those databases? Does those 16S rRNA genes have to be experimentally confirmed to be active before they get into those databases.

The reason I am asking is: because we are amplifying 16S RNA gene before sequencing, I would imagine that we get all 12 copies amplified by universal primers. Some copies might not belong to functional 16S rRNA and might not be present in taxonomy databases, but those copies will still represent the bacterium we extracted this gDNA from.

Sorry. I could probably find this information by digging up the literature, but I figured it will be easier to ask people who work with those databases and know the difference.

Regards,
–Yury

pschloss · June 14, 2013, 4:00pm

Some taxa have one sequence, some have many.

sasha8roth · October 1, 2013, 10:06am

Hi, I would like to know if there is a database already done in which GreenGenes and pds (RDP with the extra info) are together!! that would be very nice, cause I’ve had sequences that were classifies with RDP but not GreenGenes and vice versa. If somebody has it would be nice if you can pass it to me Cheers

Topic		Replies	Views
Finding .taxonomy files for Classify.seqs? Commands in mothur	10	45998	September 4, 2013
RDP and Green genes classification issues Feature requests	6	7465	April 2, 2015
classifying seqs, rdp vs silva Theory behind mothur	3	4303	November 9, 2015
Silva/RDP databases. Is it up to date? Commands in mothur	5	4370	July 15, 2014
Classify seqs Theory behind mothur	3	5173	September 10, 2014

RDP, SILVA, GreenGenes

Related topics