I’m processing 454 16S bacterial data following the Costello example. Currently I am only working with the data from a single MID while developing my pipeline, so by the time I reach the cluster.seqs step I have 1532 sequences total (~371 unique). Before all of the filtering and trimming from the Costello pipeline I have >2000 sequences.
After running through the classify commands to assign OTUs, I have one llarge OTU (>1400 sequences) and 22-11 others (at the 0.03 and 0.10 OTU definition levels). The one large OTU is not unexpected; I already knew this sample was dominated by a single species. What concerns me is that the remaining OTUs don’t appear well defined to me. My understanding is that at the 0.03 level, sequences that are at least 3% identical should be assigned to the same OTU. However, I have several OTUs containing only one sequence, and I can align these sequences on the Blast website and have <1% difference. Why aren’t they in a single OTU? Is this just reflecting a difference between the alignment algorithms? The vast majority of the sequences in these small OTUs blast as the same genus, but simply raising my cutoff definition doesn’t seem to fix it (or rather, raising the cutoff to 0.20 seems a little ridiculous when there doesn’t appear to be that much difference between the sequences).
Thanks in advance!