unifrac with overlapping groups?

Hi,

Before having even tried, is it possible to run (unwighted) Unifrac using overlapping groups?

Say, I build a phylogeny of a total dataset using e.g. clearcut.

And I have groups A, B and C which are subsets of the total dataset with overlapping OTUs.

Hence the group file can look something like this;

XXXX0001  A
XXXX0002  A
XXXX0003  A
XXXX0002  B
XXXX0004  B
XXXX0001  B
XXXX0004  C
XXXX0005  C
XXXX0006  C

It feels like this will be a problem, but maybe not?

Many thanks,

Hi,

Each sequence/sample can only show up once in a design file. Why not build the tree with the actual sequence names and go from there?

Pat

Hi Pat,

Indeed. I’ve built a tree from all representative (unique) sequences, the problem is still that each tip in the tree can correspond to multiple groups.

Or am I totally off here. Do I use the wrong fasta to build the tree from in the first place?

Many thanks,

###########

My fasta file looks like this;

[seqID|otuID|#seq]

HX1JDSX01DZF9V|1|30531
A-AC-G-A-A-C-G-C–T-G-G-C-G-G–C-A-G-
HXXS1YU02HPWCF|2|15277
A-TT-G-A-A-C-G-C–T-G-G-C-G-G–C-A-T-
HXXS1YU02JCU5F|3|12880
G-GT-G-A-A-C-G-C–T-G-G-C-G-G–C-G-C-
.
.
.

Do you have a names file associated with this data? The names file is a record of which sequences are represented by the unique sequence in your fasta file. For example, if you have fasta file

>Seq_1
AAAA
>Seq_4
ATAT

Then a names file could record that Seq_1 is also representative of Seq_2 and Seq_3. Your groups file would then record that Seq_1 was in Group A, Seq_2 in Group B and Seq_3 in Group C.

This is also the function of the count table, to more succinctly summarise the names/groups data.

I’m not sure I understand the problem and without knowing how to do it in mother, but:

if you have a tree where each representative sequence corresponds to one tip, it is (almost) always the case that the same sequences are found in different samples. Otherwise each sample would have a complete unique OTU composition. AFAIK, uni-frac explicitly quantifies the fraction of shared (uni) branch length, and weighted unifrac simply weights the branches by the abundance of the OTU at the corresponding tips.

In R, you can compute the unifrac distance easily with the phyloseq package (Picante has it too but not weighted and much slower):

example

OTU: your OTU table
ID: Your metadata table with sample names as rownames, corresponding to sample names in OTU table
TREE: your phylogenetic tree, with tip labels corresponding to OTU names in OTU table

OTUphy <- phyloseq( otu_table(OTU, taxa_are_rows = FALSE),
sample_data(ID),
phy_tree(TREE))

UF <- UniFrac(OTUphy, weighted = T)

Indeed, there was no real problem. Thanks for the tips!