Using a newer Greengenes training set

joewan · September 12, 2013, 3:29pm

I’m analyzing some short 16S reads from mouse gut samples and am trying to see if I can improve the depth of my classifications. Currently I’m using the 84,414-sequence Greengenes training set provided on the wiki, but this seems to be from 2011. I’m wondering if using the 2013 GG release might give better results.

I took a look at the Werner ISMEJ paper linked on the wiki page, and looks like they extracted representative sequences from GG after clustering at 99% similarity with UCLUST. But at 127,741 sequences their training set was somewhat larger than the one mothur uses, so it seems like the mothur version was prepared using a different method (maybe with the clustering and representative-picking tools in mothur). Could anybody fill me in on how this was done so I can replicate it with the newer GG release?

Thanks for the help.

pschloss · September 12, 2013, 8:37pm

I downloaded the gg_otus_4feb2011 from greengenes. If you look within that there’s a folder called rep_set and in there is a file called “gg_99_otus_4feb2011.fasta”. Turns out this only had 84413 sequences (I think I’ve added an ecoli sequence as the first one). The original tarball is available here:

http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Caporaso_Reference_OTUs/gg_otus_4feb2011.tgz

Not sure what happened between what was in the paper and this version.

I just updated the greengenes reference files and you can get them here...

http://www.mothur.org/wiki/Greengenes-formatted_databases

I’ve included a readme for how I generated the various files with the tar ball that you will download. Enjoy.

joewan · September 12, 2013, 9:16pm

Thanks for the timely help! Hopefully this broader training set will give better classification depth for some of the less well-represented taxa, as the Werner paper suggests. I’ll go see if that’s the case.

Topic		Replies	Views
Where can I download latest green gene database Commands in mothur	2	2891	January 2, 2013
New Greengenes database Feature requests	1	3125	October 17, 2012
Greengenes 13_8 minor release Feature requests	2	6348	July 25, 2014
classify.seqs taxonomy file error mothur bugs	10	13827	February 18, 2015
Classify.seqs Theory behind mothur	1	2436	August 30, 2015

Using a newer Greengenes training set

Related topics