dist file difficulties

Hi all
I know this topic has been covered extensively in this forum, never the less I’m still having difficulties.
I’m trying to do a downstream analysis of 20 pyrosequencing libraries and to use one of the hypotheses driven OTU’s analysis; unifrac or libshuff. The problem is the column distance file doesn’t work with the libshuff (the name file actually) command or neighbor.exe (from the pylip package) and the dist file (square) is to big > 5g. the only way I could think of is restart the data analysis without creating a name file (no unique.seqs or precluster) for the use of unifrac, but still I can’t create a tree using neighbor.exe from the pylip package.
Is there a much easier way to use read.dist for libshuff (including the name file)? and is it possible to create a tree using a column distance file?

So here’s two answers…

  1. The hypothesis testing approaches are generally worthless, so don’t bother. You will probably have so much statistical power from 454 that the libraries will be statistically different. Remember all you get from these tests is a p-value which is akin to a yes or no answer. Not incredibly informative.

  2. Assuming you don’t appreciate the cynicism or your advisor is breathing down your neck :slight_smile: you have a few options. First, do not use neighbor.exe - it is extremely slow an memory inefficient. Something like fasttree or clearcut (we have a wrapper for cc in mothur) would be better. As for the size of the distance matrix, I’m afraid you may be stuck. Make sure that you are simplifying the dataset as much as possible with judicious use of unique.seqs, filter.seqs, pre.cluster, and chimera.slayer. The column format will not gain you anything because you need all of the distances and if you don’t use a cutoff, then the column matrix ends up being larger than a phylip-formatted matrix. One option, would be to do an OTU-based approach, identify representative sequences and then build trees from those to run through unifrac. I have some misgivings about that approach, but Knight et al. seem to favor it.

Hope this helps.

thaks a lot