Sequence trimming

In order to create the distance matrix that is needed for downstream analyses I trimmed the multiple sequence alignment as I do for phylogenetic analyses (remove Ns and gaps of any length). I noticed, however, that if I don’t remove gaps, I get different results in the rarefaction curve.

Does anyone know what is the correct way to go regarding the removal of gaps?

Well, the way we suggest doing it for OTU assignment is to trim sequences so they overlap over the same region (trump=.) and then filter out the columns that only contain gaps (vertical=T). What you’re doing will remove a lot of data from the more variable regions and will probably suppress the overall genetic diversity between sequences giving you fewer OTUs. While that works to give you a broad level understanding of the community, I would avoid that approach for assigning sequences to OTUs. You’ll also notice that “some people” apply the lane mask to generate trees prior to running unifrac. I think this is also problematic because you’re removing all the genetic diversity.

Great question,
Pat