Representative OTU seqs closer than OTU cut-off?

amcomeau · April 6, 2011, 7:08pm

Hello all,
Wondering if anyone has run into this “problem” while looking at the get.oturep output - I ran my eukaryote 18S 454 sequences through mothur using the default distance and clustering methods (furthest n.) at a cut-off of 0.02. After running the get.oturep command (0.02 level again), I have a bunch of dominant OTUs with representative sequences which are nearly identical - here are two as an example copied from the output file (containing a gap in one since the output comes from the aligned input files for the dist command):

GNV8VFP05FZRD8|2601|142
CCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGTTGAGAACGGCCGGTCCGCCGTTTGGTGTGCACTGGCTGGTCTCAACTTCCTGTAGAGGACGCGCTCTGGGTTAACGCTCGGACGCGGAGTCTACGTGGTTACTTTGAAAAAATTAGAGTGTTCAAAGCGGGCTTACGCTTGAATATTTCAGCATGGAATAACACTATAGGACTCCTGTCCTATTTCGTTGGTCTCGGGACGGGAGTAATGATTAAGAGGAACAGTTGGGGCATTCGTATTTCATTGTCAGAGGTGAAATTCTTGGGATTTATGAAAGACGAACTTCTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGCTCGAAGATGATTAGATACCAT
GNV8VFP05FU55X|2597|165
CTGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGTTGAGAACGGCCGGTCCGCCGTTTGGTGTGCACTGGCTGGTCTCAACTTCCTGTAGAGGACGCGCTCTGGGTTAACGCTCGGACGCGGAGTCTACGTGGTTACTTTGAAAAAATTAGAGTGTTCAAAGCGGGCTTACGCTTGAATATTTCAGCATGGAATAACACTATAGGACTCCTGTCCTATTTCGTTGGTCTCGGGACGGGAGTAATGATTAAGAGGAACAGTTGGGGCATTCGTATTTCATTGTCAGAGGTGAAATTCTT-GGATTTATGAAAGACGAACTTCTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGCTCGAAGACGATCAGATACCGT

If you align these two sequences (to re-create what it was in the output file), you’ll see that they have 4 single-base-pair differences and one gap (counts as a single difference with the default options), therefore 5 difference over 432 bp of sequence. This equals a difference of 5/432 = 1.16%, well below the cut-off of 2% mandated in the cluster step to make the OTUs.

So why do these two OTUs exist separately instead of being collapsed into one OTU?

I suspect it is with the OTU clustering since it takes a sequence and grabs all the things around it <2% distant to make a node (OTU#1), then moves to the next available sequence to make the next node (OTU#2), but that, in the end, you may have some sequences at the fringes of the two OTUs that are closer to each other (<2%) than the “middle” of the nodes are from one another (>2%), if you see what I mean…

Can someone back me up on this/confirm. Even if this is somewhat the explanation, this makes me a bit uncomfortable…hopefully I just have a bug…

pschloss · April 7, 2011, 2:28pm

I suspect it is with the OTU clustering since it takes a sequence and grabs all the things around it <2% distant to make a node (OTU#1), then moves to the next available sequence to make the next node (OTU#2), but that, in the end, you may have some sequences at the fringes of the two OTUs that are closer to each other (<2%) than the “middle” of the nodes are from one another (>2%), if you see what I mean…

That would be my guess. We haven’t changed the default yet, but we far prefer using average neighbor at this point because it gets around these types of problems.

amcomeau · April 12, 2011, 5:29pm

OK Pat,
I read through your Talmudic question post and the Huse paper talking about Average Neighbor (AN). Since this question is still bugging me, I did a little test with 1000s of 454 seqs from Bacteria, Archaea and Eukarya that I had previously analyzed with the default Furthest Neighbor (FN) and reran them through AN:

Furthest Neighbor / Average Neighbor OTUs:
Bacteria = 907 / 817 (-11%)
Archaea = 901 / 641 (-41%)
Eukarya = 4051 / 4083 (+1%)

Furthest Neighbor / Average Neighbor Singletons:
Bacteria = 490 / 567
Archaea = 205 / 330
Eukarya = 2462 / 3084

Furthest Neighbor / Average Neighbor % Singletons:
Bacteria = 54% / 69%
Archaea = 23% / 52%
Eukarya = 61% / 76%

Therefore, yes, AN did “deflate” potentially “overestimated” FN OTU numbers, but not evenly: especially so for the low-diversity Archaea sample, but not at all for the highly diverse Euk sample. I assume this will also be sensitive to the overall diversity within the Domains between different samples.

The major problem I see, however, is that AN seems to have significantly increased the numbers/% singletons in the OTUs, thereby in turn “re-inflating” the diversity/OTUs - this I don’t like very much, especially in a world where people want you to throw out all the singletons (although I’m not totally in favor of this and am not alone in the community).

I guess it comes down to the Talmudic question you asked: Which is more problematic - false-positives (seqs in an OTU if they don’t belong) or false-negatives (not including seqs in an OTU with others)?

Given that you showed that FN makes 0% false-positive errors, I think I’m still leaning towards this clustering algorithm… For us, having a seq included in an “Arctic Micromonas CCMP2099” OTU when it is in fact some other type of Micromonas seq (false-positive; which NN does alot of and AN does some of) would be worse than having it fall out into its own second OTU (false-negative) which will get caught later on anyways when the taxonomy assignments are all counted up for drawing distribution graphs.

pschloss · April 13, 2011, 4:26pm

That’s certainly why we give people options. Since I posted that Talmudic question, we wrote the paper that will be due out in AEM in the near future…

Topic		Replies	Views
wrong clustering mothur bugs	6	6540	August 4, 2010
Talmudic question #1 Theory behind mothur	6	13237	April 16, 2010
A problem in clustering Commands in mothur	2	2366	October 2, 2012
Average Clustering of ~10k unique V6 sequences Commands in mothur	10	8672	May 27, 2011
Cluster.split Commands in mothur	1	1920	December 20, 2014

Representative OTU seqs closer than OTU cut-off?

Related topics