Newick tree clustering

Hi Forum,
I have yet another question…

Following the Schloss SOP I’ve generated a tree using the tree.shared command. I’ve done this on a sub-sample of my data with only the OTUs that have a prevalence of >5% in at least one sample, resulting in 27 OTUs. Two of the OTUs are dominating and segregating, so that most sample have 70-95% of either of the two OTUs. What puzzles me is that when I build the tree, some samples with different dominant OTUs come out closer to each other than samples with the same dominant OTU. There is no obvious pattern in the other 25 OTUs that explains this clustering. This seems very counter intuitive to me as I would expect two main clusters based on the dominant OTUs and then sub-clustering based on the other OTUs (this is also what I get in i PCA in JMP stat software).

I’d be happy to provide the files if anyone is willing to have a look or have any suggestions for what may cause this.

Thanks in advance,

what distance measure are you using for the two analysis? You can look for different types of community structure/similarity depending on which distance measure you choose. For example, the cluster dendograms may look very different if one measure is an abundance metric and the other is presence/absence. Neither is right or wrong, you just need to know what your question is and which metric is most appropriate for that question and your data.

Thanks for your reply, but I don’t quite follow you -but this is maybe because I then don’t really get the command!
The .shared file I’m using for tree.shared looks like this

label Group numOtus Otu001 Otu002 Otu003 Otu004 Otu005 Otu007 Otu009 …
0.03 Ae.505C.v341F 27 97 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.03 Ae.322C.v341F 27 3 0 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.03 Ae.528C.v341F 27 98 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

So for each OTU I have the abundance (in percent)

OTU 1 and 3 and dominating the samples. In the final tree with 59 samples Ae.505 however comes out right next to Ae.322 and far from Ae.528 (the majority but not all samples show the same pattern with either OTU 1 or 3 dominating) .

I’ve looked at both the Jaccard and Yue & Clayton theta trees in TreeView.

Is my .shared file not appropriate for this analysis?

This is a weird way of doing what you want. Why not just use a normal shared file? Anyway, I entered what you gave us, but trimmed everything to 5 OTUs. Here’s what the output of dist.shared looks like for Jaccard…

Ae.322C.v341F 0.000000 
Ae.528C.v341F 0.500000 0.500000

And for ThetaYC…

Ae.322C.v341F 0.978450 
Ae.528C.v341F 0.000210 0.983800

Those values look correct. Looking at the trees… For Jaccard:


For ThetaYC:


As you’ll see with Jaccard 505 and 322 cluster together and by ThetaYC 528 and 322 cluster together. This makes sense looking at the data. As kmitchell correctly pointed out, Jaccard is based on presence/absence and ThetaYC is based on relative abundance. Frankly, I don’t put much stock in presence/absence based metrics because rare bugs are hard to find and these metrics are sensitive to under sampling. But you have to keep in mind that these two metric measure two different things - membership (Jaccard) vs. structure (ThetaYC)

Hope this helps,

Hi Pat and kmitchell,
Thanks for the input, I think indeed my confusion was caused by the difference between the two measures, my thetaYC tree actually looks exactly like I expected.
I had revised the OTU table in an attempt to normalize the data (I have quite different sequencing depths between samples but I guess normalize.shared could have done the trick), and to only focus on the most abundant OTUs. Building the tree with the original .shared file however gives pretty much the same result.
Thank you!