Representative OTU seqs closer than OTU cut-off?

Hello all,
Wondering if anyone has run into this “problem” while looking at the get.oturep output - I ran my eukaryote 18S 454 sequences through mothur using the default distance and clustering methods (furthest n.) at a cut-off of 0.02. After running the get.oturep command (0.02 level again), I have a bunch of dominant OTUs with representative sequences which are nearly identical - here are two as an example copied from the output file (containing a gap in one since the output comes from the aligned input files for the dist command):

GNV8VFP05FZRD8|2601|142
CCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGTTGAGAACGGCCGGTCCGCCGTTTGGTGTGCACTGGCTGGTCTCAACTTCCTGTAGAGGACGCGCTCTGGGTTAACGCTCGGACGCGGAGTCTACGTGGTTACTTTGAAAAAATTAGAGTGTTCAAAGCGGGCTTACGCTTGAATATTTCAGCATGGAATAACACTATAGGACTCCTGTCCTATTTCGTTGGTCTCGGGACGGGAGTAATGATTAAGAGGAACAGTTGGGGCATTCGTATTTCATTGTCAGAGGTGAAATTCTTGGGATTTATGAAAGACGAACTTCTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGCTCGAAGATGATTAGATACCAT
GNV8VFP05FU55X|2597|165
CTGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGTTGAGAACGGCCGGTCCGCCGTTTGGTGTGCACTGGCTGGTCTCAACTTCCTGTAGAGGACGCGCTCTGGGTTAACGCTCGGACGCGGAGTCTACGTGGTTACTTTGAAAAAATTAGAGTGTTCAAAGCGGGCTTACGCTTGAATATTTCAGCATGGAATAACACTATAGGACTCCTGTCCTATTTCGTTGGTCTCGGGACGGGAGTAATGATTAAGAGGAACAGTTGGGGCATTCGTATTTCATTGTCAGAGGTGAAATTCTT-GGATTTATGAAAGACGAACTTCTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGCTCGAAGACGATCAGATACCGT

If you align these two sequences (to re-create what it was in the output file), you’ll see that they have 4 single-base-pair differences and one gap (counts as a single difference with the default options), therefore 5 difference over 432 bp of sequence. This equals a difference of 5/432 = 1.16%, well below the cut-off of 2% mandated in the cluster step to make the OTUs.

So why do these two OTUs exist separately instead of being collapsed into one OTU?

I suspect it is with the OTU clustering since it takes a sequence and grabs all the things around it <2% distant to make a node (OTU#1), then moves to the next available sequence to make the next node (OTU#2), but that, in the end, you may have some sequences at the fringes of the two OTUs that are closer to each other (<2%) than the “middle” of the nodes are from one another (>2%), if you see what I mean…

Can someone back me up on this/confirm. Even if this is somewhat the explanation, this makes me a bit uncomfortable…hopefully I just have a bug…

I suspect it is with the OTU clustering since it takes a sequence and grabs all the things around it <2% distant to make a node (OTU#1), then moves to the next available sequence to make the next node (OTU#2), but that, in the end, you may have some sequences at the fringes of the two OTUs that are closer to each other (<2%) than the “middle” of the nodes are from one another (>2%), if you see what I mean…

That would be my guess. We haven’t changed the default yet, but we far prefer using average neighbor at this point because it gets around these types of problems.

OK Pat,
I read through your Talmudic question post and the Huse paper talking about Average Neighbor (AN). Since this question is still bugging me, I did a little test with 1000s of 454 seqs from Bacteria, Archaea and Eukarya that I had previously analyzed with the default Furthest Neighbor (FN) and reran them through AN:

Furthest Neighbor / Average Neighbor OTUs:
Bacteria = 907 / 817 (-11%)
Archaea = 901 / 641 (-41%)
Eukarya = 4051 / 4083 (+1%)

Furthest Neighbor / Average Neighbor Singletons:
Bacteria = 490 / 567
Archaea = 205 / 330
Eukarya = 2462 / 3084

Furthest Neighbor / Average Neighbor % Singletons:
Bacteria = 54% / 69%
Archaea = 23% / 52%
Eukarya = 61% / 76%

Therefore, yes, AN did “deflate” potentially “overestimated” FN OTU numbers, but not evenly: especially so for the low-diversity Archaea sample, but not at all for the highly diverse Euk sample. I assume this will also be sensitive to the overall diversity within the Domains between different samples.

The major problem I see, however, is that AN seems to have significantly increased the numbers/% singletons in the OTUs, thereby in turn “re-inflating” the diversity/OTUs - this I don’t like very much, especially in a world where people want you to throw out all the singletons (although I’m not totally in favor of this and am not alone in the community).

I guess it comes down to the Talmudic question you asked: Which is more problematic - false-positives (seqs in an OTU if they don’t belong) or false-negatives (not including seqs in an OTU with others)?

Given that you showed that FN makes 0% false-positive errors, I think I’m still leaning towards this clustering algorithm… For us, having a seq included in an “Arctic Micromonas CCMP2099” OTU when it is in fact some other type of Micromonas seq (false-positive; which NN does alot of and AN does some of) would be worse than having it fall out into its own second OTU (false-negative) which will get caught later on anyways when the taxonomy assignments are all counted up for drawing distribution graphs.

That’s certainly why we give people options. Since I posted that Talmudic question, we wrote the paper that will be due out in AEM in the near future…