Hi all,
In the interest of making the best use of unifrac.weighted on OTU clusters, I have been investigating the effect of building trees (using either clearcut wrapper in mothur or the fasttree program) using either representative sequences for each OTU (get.oturep) or consensus sequences (consensus.seqs). I have a few technical and conceptual questions about this and I would appreciate forum and/or schlossian input. Concensus.seqs is much faster and doesn’t require a distance matrix, making it more desirable computationally, but get.oturep produces a very satisfying subset of actual sequences which is intuitively cleaner to work with downstream…
-
consensus.seqs sometimes uses lowercase letters in the sequence, particularly for variable bases. Why?
-
get.oturep allows incorporation of a group file to ensure that a representative sequence is selected for each sample. Is this reasonable for consensus.seqs, and are there theoretical arguments for or against having sample-specific consensus/representative sequences for OTUs?
-
any conceptual/theoretical argument for using representative sequences vs. concensus sequences to build phylogenies of OTUs? I prefer to run Unifrac on weighted OTUs rather than weighted unique sequences because it makes the trees smaller, the nodes more accurate, and removes some of the small-scale variability which is more or less meaningless at the community level for my questions. I know that fasttree ignores all but ATCG; anyone know if clearcut also ignores ambiguous bases? Are there tree-building programs which explicitly incorporate nucleotide ambiguity codes into evolutionary models?
By the way, unifrac.weighted produces highly correlated distance matrices (r > 0.85) using either NJ or ML trees and either rep or consensus sequences…but I’d like to have a stronger basis for these conclusions than the standard microbial ecology line of “it doesn’t really matter either way.”