cluster.seqs and OTU assignment

I’m processing 454 16S bacterial data following the Costello example. Currently I am only working with the data from a single MID while developing my pipeline, so by the time I reach the cluster.seqs step I have 1532 sequences total (~371 unique). Before all of the filtering and trimming from the Costello pipeline I have >2000 sequences.

After running through the classify commands to assign OTUs, I have one llarge OTU (>1400 sequences) and 22-11 others (at the 0.03 and 0.10 OTU definition levels). The one large OTU is not unexpected; I already knew this sample was dominated by a single species. What concerns me is that the remaining OTUs don’t appear well defined to me. My understanding is that at the 0.03 level, sequences that are at least 3% identical should be assigned to the same OTU. However, I have several OTUs containing only one sequence, and I can align these sequences on the Blast website and have <1% difference. Why aren’t they in a single OTU? Is this just reflecting a difference between the alignment algorithms? The vast majority of the sequences in these small OTUs blast as the same genus, but simply raising my cutoff definition doesn’t seem to fix it (or rather, raising the cutoff to 0.20 seems a little ridiculous when there doesn’t appear to be that much difference between the sequences).

Thanks in advance!

Hmm… That sounds weird. So a couple things…

  1. BLAST is a local alignment algorithm, which means that it only reports a % similarity for the most conserved portion of the sequence. So it could align 100 of 200 bases and they could be 100% identical over that 100 bases, but very different over the other 100 bases. In contrast, mothur uses a global approach and calculates the distance over the full length of the gene.

  2. I’m not sure what #2 could be - can you post two of the singleton OTUs that you think should be in the same OTU and we can take a look? Feel free to post the two sequences here or to email them to

Emailed in the sequences and my batch file.