Selecting sequences for unifrac analysis.


I’m working on some 16S sequence data, obtained from several sources and am using mothur for doing my statistical analysis. I’ve followed through the esophageal community analysis example and have everything working fine.

I just had a question regarding the unifrac command. Some of the sequence data I’m using came from an earlier study and has a significantly smaller number of sequences to work with. I’m aware that unifrac can be influenced by differences in sample numbers so I was wondering about the best way to select a subset of the larger sequence pool to match to this smaller set.

When I compare two of the groups in my data with an unweighted unifrac I get a p-value of 0.04, which is just a little bit too high to accept a significant difference (I would need a p-value of less than 0.0167), but since it’s coming quite close to the threshold I want to be sure that the groups aren’t being pushed apart by a bias in the methodology.

The way I currently see it is that my options are to either manually select what I consider to be a representative sample of each pool so that I end up with the same number of sequences in each group, or to just randomly take a subset of each group(and probably repeat the process a few times to try and avoid any skews). To take that even further, I could make multiple subsets of the larger sequence pool and then compare them to test for significant differences caused by the selection process.

Any advice or advancements on what I’ve suggested would be great, thanks.

You might try out the sub.sample command which is in the wiki. This command will allow you to specify the number of sequences in each group. You can go from there with your analysis.