High % unknowns after classify.otu

Hello,

I’ve classified 16S sequences using the Miseq SOP using both the phylotype method and the OTU based method but my output files show 26% unknowns at the family level for for both approaches. Since these are environmental samples I was expecting some unknowns but I assumed there would be less unknowns in an OTU based method as opposed to a phylotype method, is this a valid assumption or would you expect similar unknown rates?

Additionally, in general is there any way to decrease the number of unknowns that I’m seeing? Related to that, how valid is the RDP reference files? (Since the more people use databases the more things get incorporated that are poor quality which can lead to poor matching).

I’ve classified 16S sequences using the Miseq SOP using both the phylotype method and the OTU based method but my output files show 26% unknowns at the family level for for both approaches. Since these are environmental samples I was expecting some unknowns but I assumed there would be less unknowns in an OTU based method as opposed to a phylotype method, is this a valid assumption or would you expect similar unknown rates?

What environment? How long are the reads? What region? Different databases vary in their ability to classify sequences. Read length is also important as is the region within the 16S rRNA gene.

Additionally, in general is there any way to decrease the number of unknowns that I’m seeing? Related to that, how valid is the RDP reference files? (Since the more people use databases the more things get incorporated that are poor quality which can lead to poor matching).

How valid are the RDP files? Well… I’m not sure what you mean by that. They’re the same files that the RDP uses. The RDP database is pretty well vetted and is manually curated. Other databases you might try include:

greengenes: http://www.mothur.org/wiki/Greengenes-formatted_databases

silva: http://www.mothur.org/wiki/Silva_reference_files

Thanks for your reply, I should have been clearer on everything.

My environment is pitcher plant fluid. The reads are 250bp focusing on the v3 v4 region.

What I’m asking about the RDP files is not so much how good the database is on its own but rather, do you think I’d have better luck reducing unknowns if I used another database? Or more specifically, how does RDP compare to greengenes or silva in terms of quality? I got the impression that you aren’t fond of the greengenes database because of the poor alignment in the variable regions but how do the other databases they compare to each other?

Thanks for your help thus far!

I got the impression that you aren’t fond of the greengenes database because of the poor alignment in the variable regions but how do the other databases they compare to each other?

Hmmm… While I don’t like it for its alignments, I do think it’s pretty good for classification. The classifications are alignment independent and I’m sure it does a better job in the candidate phyla than RDP. We’ve had variable results with each of the three databases with different sample types. I’d suggest trying the new SILVA and greengenes databases we’ve posted and see how they fare.

Pat