I’m analyzing some short 16S reads from mouse gut samples and am trying to see if I can improve the depth of my classifications. Currently I’m using the 84,414-sequence Greengenes training set provided on the wiki, but this seems to be from 2011. I’m wondering if using the 2013 GG release might give better results.
I took a look at the Werner ISMEJ paper linked on the wiki page, and looks like they extracted representative sequences from GG after clustering at 99% similarity with UCLUST. But at 127,741 sequences their training set was somewhat larger than the one mothur uses, so it seems like the mothur version was prepared using a different method (maybe with the clustering and representative-picking tools in mothur). Could anybody fill me in on how this was done so I can replicate it with the newer GG release?
I downloaded the gg_otus_4feb2011 from greengenes. If you look within that there’s a folder called rep_set and in there is a file called “gg_99_otus_4feb2011.fasta”. Turns out this only had 84413 sequences (I think I’ve added an ecoli sequence as the first one). The original tarball is available here:
Thanks for the timely help! Hopefully this broader training set will give better classification depth for some of the less well-represented taxa, as the Werner paper suggests. I’ll go see if that’s the case.