consensus.seqs: lowercase, vs get.oturep, and groups

Hi all,

In the interest of making the best use of unifrac.weighted on OTU clusters, I have been investigating the effect of building trees (using either clearcut wrapper in mothur or the fasttree program) using either representative sequences for each OTU (get.oturep) or consensus sequences (consensus.seqs). I have a few technical and conceptual questions about this and I would appreciate forum and/or schlossian input. Concensus.seqs is much faster and doesn’t require a distance matrix, making it more desirable computationally, but get.oturep produces a very satisfying subset of actual sequences which is intuitively cleaner to work with downstream…

  1. consensus.seqs sometimes uses lowercase letters in the sequence, particularly for variable bases. Why?

  2. get.oturep allows incorporation of a group file to ensure that a representative sequence is selected for each sample. Is this reasonable for consensus.seqs, and are there theoretical arguments for or against having sample-specific consensus/representative sequences for OTUs?

  3. any conceptual/theoretical argument for using representative sequences vs. concensus sequences to build phylogenies of OTUs? I prefer to run Unifrac on weighted OTUs rather than weighted unique sequences because it makes the trees smaller, the nodes more accurate, and removes some of the small-scale variability which is more or less meaningless at the community level for my questions. I know that fasttree ignores all but ATCG; anyone know if clearcut also ignores ambiguous bases? Are there tree-building programs which explicitly incorporate nucleotide ambiguity codes into evolutionary models?

By the way, unifrac.weighted produces highly correlated distance matrices (r > 0.85) using either NJ or ML trees and either rep or consensus sequences…but I’d like to have a stronger basis for these conclusions than the standard microbial ecology line of “it doesn’t really matter either way.” :slight_smile: