Using a newer Greengenes training set

I’m analyzing some short 16S reads from mouse gut samples and am trying to see if I can improve the depth of my classifications. Currently I’m using the 84,414-sequence Greengenes training set provided on the wiki, but this seems to be from 2011. I’m wondering if using the 2013 GG release might give better results.

I took a look at the Werner ISMEJ paper linked on the wiki page, and looks like they extracted representative sequences from GG after clustering at 99% similarity with UCLUST. But at 127,741 sequences their training set was somewhat larger than the one mothur uses, so it seems like the mothur version was prepared using a different method (maybe with the clustering and representative-picking tools in mothur). Could anybody fill me in on how this was done so I can replicate it with the newer GG release?

Thanks for the help.

I downloaded the gg_otus_4feb2011 from greengenes. If you look within that there’s a folder called rep_set and in there is a file called “gg_99_otus_4feb2011.fasta”. Turns out this only had 84413 sequences (I think I’ve added an ecoli sequence as the first one). The original tarball is available here:

http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Caporaso_Reference_OTUs/gg_otus_4feb2011.tgz

Not sure what happened between what was in the paper and this version.


I just updated the greengenes reference files and you can get them here...

http://www.mothur.org/wiki/Greengenes-formatted_databases

I’ve included a readme for how I generated the various files with the tar ball that you will download. Enjoy.

Thanks for the timely help! Hopefully this broader training set will give better classification depth for some of the less well-represented taxa, as the Werner paper suggests. I’ll go see if that’s the case.