OTUs & Ecological Consistency

Dear mothur community,

I’d like to shamelessly advertise a paper we just published that may be interesting to some of you (for the others, sorry for the SPAM :wink: ):

Schmidt TSB, Matias Rodrigues JF, Mering von C (2014) Ecological Consistency of SSU rRNA-Based Operational Taxonomic Units at a Global Scale. PLOS Computational biology 10: e1003594EP–. doi:10.1371/journal.pcbi.1003594.

We investigated the question whether OTUs are “ecologically meaningful”, in the sense that they cluster sequences of similar ecological affiliation. In fact, in their recent 2013 paper, Koeppel & Wu had doubted that OTUs make sense ecologically:

Koeppel AF, Wu M (2013) Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units. Nucleic Acids Research 41: 5175–5188. doi:10.1093/nar/gkt241.

They used very fine-scale ecological descriptions and very small datasets and found that their proposed algorithm to simulate “ecotypes” performed better than traditional OTU clustering. However, ecotype simulation is highly parametric, assumes that the Stable Ecotype model of bacterial speciation is true for all microbial taxa, and the software has throughput problems even with medium-sized sequence datasets. We took a different approach to assessing OTU ecological consistency for a global, highly diverse dataset, and for ecological descriptions at different resolutions. We found that OTUs were generally, but not perfectly consistent - in other words, they’re probably “good enough” for most practical purposes.

Moreover, we compared ecological consistency for several different methods (average, complete and single linkage hierarchical clustering, plus cd-hit and uclust). The ecological signals we used can be interpreted as a (sequence- and taxonomy-independent, external) measure of clustering “quality”. It’s hard to define any meaningful benchmark for the “goodness” of OTUs - Patrick Schloss has done some important work in that direction in the past. We think that “ecological consistency” is a useful way of looking at “OTU quality” that can complement such earlier studies.
Somewhat to our surprise, we found that complete linkage (rather than average linkage) provided the most “ecologically consistent” OTUs. We are currently following up on this, and in several other tests we have seen that instead average linkage may perform best. In other words: our study corroborates the idea that (hierarchical) complete linkage clustering is a good choice if you want “ecologically consistent” OTUs.

As said above, this is one the one hand to shamelessly advertise the paper ;). On the other hand, I thought I would put it out here as this seems to be a great place to get some (critical) feedback on the work, and some discussion with people who demarcate OTUs from real-life datasets every day :slight_smile:



Hi Sebastian,

Glad to see you and others in the microbial community involve in solving this and many ecological questions.

You briefly acknowledged the importance to test this idea on data generated from partial 16S sequences (regions: v4 or v1-v3, etc). As you may aware the majority of the work we (i.e. the microbial ecologist community) produced and published generated this type of data (for now).

Do you performed some preliminary analysis or have an idea is this will be consistence (use of complete linkage) with the results you obtained with full length sequences?


Hi Vicente,

thanks for your feedback! :slight_smile:

The short answer: we didn’t run the same tests on shortread data, but we did some other tests.

We used a dataset of full-length sequences for several reasons:

(i) Every 16S subregion, or set of subregions, behaves (slightly) differently in OTU demarcation. This has been shown quite nicely e.g. by Patrick Schloss (PLOS Comp Biol, 2010) and also by Kim et al (Journal of Microbiol Methods, 2011). We wanted to design a broadly applicable test set that would not only be “true” for a subset of targeted subregions.

(ii) Depending on sequencing platform and pre-filtering, published shortread datasets can have highly divergent sequence length and sequence quality. We restricted our study to (near) full-length sequences and applied rather strict quality criteria, in order to obtain a consistently high-quality test set.

(iii) The dataset we used resembles in scope and pre-processing very much the reference sets provided e.g. by RDP, Greengenes and SILVA. These are used in many different contexts, e.g. also for ‘reference-based OTU picking’.

(iv) We wanted to use a dataset of very broad ecological scope. We could have composed our own test dataset from available shortread datasets, but that would have been (even more) biased towards individual environments. Instead, we grabbed all data for (high-quality, full-length) sequences available via GenBank / RefSeq.

(v) I’m not expert on sequencing technology, but from what I hear the next generation of platforms (or is it next-next-gen in the meantime, or next-next-next-gen?!) will achieve read lengths of ≥1,000 bp. Hopefully, full-length 16S sequencing at very high throughput will soon be possible.

Nevertheless, we had repeated a very similar analysis as detailed in the paper on shortread sequences (V23, V35 and V6), extracted from the full-length dataset. Results were consistent.

Moreover, I have in the meantime looked into a couple of other factors regarding shortread clustering. The results are unpublished, but they point to hierarchical complete and average linkage clustering as being very reasonable choices.