OTU classification and minimum entropy decomposition

I’m a new user of mothur and I’ve been reading about the methodology of OTU generation. Last year a paper came out by Eren et al. that generates OTUs by using minimum entropy decomposition (MED), where information-rich base positions are used to separate a group of sequences into smaller groups iteratively, ultimately ending with small groups/final OTUs. The conventional approach implemented by the mothur uses, initially, binning sequences into a taxonomic level (e.g. Order), then doing de novo clustering for each bin. (Please let me know if I am inaccurate with my statements).

2 questions come to mind:

  1. Is MED potentially a “better” approach than the current one? Or is MED good only in the sense that it can tease apart very similar sequences into separate OTUs?

  2. How do you deal with multiple copies of 16S genes in one organism? I think I read once that the # of 16S genes within one species can range up to tens of copies, pseudogenes included. Would this result in inaccuracy when inferring organismal abundance from 16S reads?

Ref:
Eren et. al ISME 2015: http://www.nature.com/ismej/journal/v9/n4/full/ismej2014195a.html

I wouldn’t comment on whether it’s ‘better’ or not, but it’s more just a different way to analyse the data. If I remember the the manuscript correctly, the authors pull out a single genus or sequence cluster to analyse - they don’t use oligotyping across the full data set. The approach is designed to look for vary subtle differentiation within a mostly identical set of sequences.

I would say that it’s inappropriate to apply oligotyping to your full data, but it’s more of a downstream extension in your analysis. For example, you build up your OTU table and look at all the differences, then if you notice a particular OTU or genus is doing interesting things you could examine it further with oligotyping.

For your second question - yes, it does. But if you’re looking at 16S data alone there’s nothing you can do about it :mrgreen: It’s just one of those limitations of the method. You can kind of get around this (at least the straight copy number difference) by using presence/absence methods of comparison.