unifrac with overlapping groups?

Johannes · February 16, 2016, 12:51pm

Hi,

Before having even tried, is it possible to run (unwighted) Unifrac using overlapping groups?

Say, I build a phylogeny of a total dataset using e.g. clearcut.

And I have groups A, B and C which are subsets of the total dataset with overlapping OTUs.

Hence the group file can look something like this;

XXXX0001  A
XXXX0002  A
XXXX0003  A
XXXX0002  B
XXXX0004  B
XXXX0001  B
XXXX0004  C
XXXX0005  C
XXXX0006  C

It feels like this will be a problem, but maybe not?

Many thanks,

pschloss · February 18, 2016, 11:28am

Hi,

Each sequence/sample can only show up once in a design file. Why not build the tree with the actual sequence names and go from there?

Pat

Johannes · February 18, 2016, 2:04pm

Hi Pat,

Indeed. I’ve built a tree from all representative (unique) sequences, the problem is still that each tip in the tree can correspond to multiple groups.

Or am I totally off here. Do I use the wrong fasta to build the tree from in the first place?

Many thanks,

###########

My fasta file looks like this;

[seqID|otuID|#seq]

HX1JDSX01DZF9V|1|30531
A-AC-G-A-A-C-G-C–T-G-G-C-G-G–C-A-G-
HXXS1YU02HPWCF|2|15277
A-TT-G-A-A-C-G-C–T-G-G-C-G-G–C-A-T-
HXXS1YU02JCU5F|3|12880
G-GT-G-A-A-C-G-C–T-G-G-C-G-G–C-G-C-
.
.
.

dwaite · February 18, 2016, 7:32pm

Do you have a names file associated with this data? The names file is a record of which sequences are represented by the unique sequence in your fasta file. For example, if you have fasta file

>Seq_1
AAAA
>Seq_4
ATAT

Then a names file could record that Seq_1 is also representative of Seq_2 and Seq_3. Your groups file would then record that Seq_1 was in Group A, Seq_2 in Group B and Seq_3 in Group C.

This is also the function of the count table, to more succinctly summarise the names/groups data.

fabianr · February 22, 2016, 5:03pm

I’m not sure I understand the problem and without knowing how to do it in mother, but:

if you have a tree where each representative sequence corresponds to one tip, it is (almost) always the case that the same sequences are found in different samples. Otherwise each sample would have a complete unique OTU composition. AFAIK, uni-frac explicitly quantifies the fraction of shared (uni) branch length, and weighted unifrac simply weights the branches by the abundance of the OTU at the corresponding tips.

In R, you can compute the unifrac distance easily with the phyloseq package (Picante has it too but not weighted and much slower):

example

OTU: your OTU table
ID: Your metadata table with sample names as rownames, corresponding to sample names in OTU table
TREE: your phylogenetic tree, with tip labels corresponding to OTU names in OTU table

OTUphy <- phyloseq( otu_table(OTU, taxa_are_rows = FALSE),
sample_data(ID),
phy_tree(TREE))

UF <- UniFrac(OTUphy, weighted = T)

Johannes · February 23, 2016, 11:43am

Indeed, there was no real problem. Thanks for the tips!

Topic		Replies	Views
unifrac not all tree sequences in group file Commands in mothur	1	2136	June 17, 2013
Not a valid group error Commands in mothur	3	3520	August 22, 2012
unifrac error mothur bugs	2	2259	March 26, 2015
Unifrac Design File Generation for No Treatments Commands in mothur	1	1280	February 4, 2016
get groups at unifrac Commands in mothur	1	1687	September 25, 2014

unifrac with overlapping groups?

example

Related topics