So, I’m dealing with data from “The Earth Microbiome Project”… Illumina HiSeq @ 100bp… I decided to run the data through Mothur rather than used the QIIME analysed data they provide. They say;
_97% OTUs are picked in the initial EMP analyses using a closed-reference OTU picking protocol against the Greengenes database pre-clustered at 97% identify
This is done using pick_reference_otus_through_otu_table.py.
This process works as follows. Reads are pre-sorted by abundance in QIIME so the most frequently occurring sequences will be chosen as OTU centroid sequences. Each read is then searched against the Greengenes reference sequences using reference-based uclust version 1.2.22. The call to uclust issued by QIIME looks like:
uclust --id 0.97 --w 12 --stepwords 20 --usersort --maxaccepts 20 --libonly --stable_sort --maxrejects 500
Reads which hit a sequence in the reference collection at greater than or equal to 97% identity are assigned to an OTU defined by the reference sequence they match. Reads which fail to hit a reference sequence at at least 97% identity are discarded_
with that I end up with about 200 000 sequences per sample, with presumably good taxonomy (unclassifieds have intrinsically been removed)
when I run the EMP quality filtered sequences through Mothur, i end up with 55000 sequences (there is a greater quality control with the mothur run, the initial quality filtered sequences had base calls as low as Q18) and hopeless taxonomy - there is a tonne of unclassified sequences.
so how can I end up poorer taxonomy with greater quality sequences? Any ideas? How do they algorithms in QIIME work? Their website is confusing…