Analyzing part of a dataset

I’ve been working with a dataset and have a problem I can’t quite figure out. The data comes from a MiSeq and I’ve processed it per instructions in the MiSeq and 454 example pages (thank you, thank you for those). I am now at the ‘fun’ part of my data.

The data looks like this: I am studying the lung microbiome in human subjects. One group of subjects have a disease (n = 40), one group is control (n = 20). I have two samples collected from each subject, based on location in the lung – let’s call them loc1 and loc2. I also have a set of negative, non-patient controls (N = 20). So each fastq file in the end looks like this:


And so on.

I’ve run the MiSeq and 454 scripts of all the fastq files all the way through without a problem as a single dataset. Now the subjects are of two different ancestries: European (EA) and African (AA). I now get to do a fun analysis like this one:

unifrac.weighted(tree=dataset.otu.thetayc.unique.tre,, random=T)

where I’m using my entire dataset and a design file,, to separate the subjects by race. This executes perfectly in Mothur.

Here’s the question: I now want to look JUST at loc1 by race, and then loc2 by race. Let’s generate a new design file (e.g., that lists just the loc1 files for each subject along with all the negative controls:

unifrac.weighted(tree=dataset.otu.thetayc.unique.tre,, random=T)

This fails: I get 40 error messages, each on a separate line, that look like this:

[ERROR]: Your group file does not contain SUBJ1LOC2. Please correct.

Etc etc for each of the loc2 samples. Then I get one last error message:

Name: SUBJ20LOC2 is not in your groupfile, and will be disregarded.
[ERROR]: Your count table contains more than 1 sequence named SUBJ20LOC2, sequence names must be unique. Please correct.
error with lc

Sorry for being long-winded, but I want you to see the situation. My question is simple: having processed my entire dataset (all subjects, both locations, etc) as a single unit through the MiSeq and 454 protocols, how do I now analyze just one part of the dataset? Just loc1, or just loc2 for example?

I’m concerned that if I just (in the master file) set loc2 to ‘null’, for example, that those data will be analyzed anyway.

Many thanks in advance for what I hope is a simple answer to a long question!

What do the first lines of the new design file look like? This frequently happens because there are spaces in the sample or group names - could that be a possibility?


Pat, thanks for the response.

No spaces in any cell – I’m looking at it now. The original design file has all samples, both loc1 and loc2. The new design file has ONLY the loc1 samples.

Each design file has two columns: sample name and race code (EA or AA).

The design file that works has 138 rows, one for each sample (patient loc1, patient loc2 for all 59 patients and 20 negative samples). The design file that doesn’t work omits all the loc2 rows, so it is 79 rows.

Is the fact that Mothur is looking for rows 80 to 138 a problem, since my original dataset run throw the MiSeq SOP included all 138 samples? Must I start from the top with only the loc1 and negative samples? Or is there a way to tell Mothur “for this particular unifrac analysis use this second design table that is missing 59 labels you’re otherwise expecting to see and don’t worry about it”.

Should I use my original design table and in column 2 (for race), put “NA” for all the loc2 samples? Or does Mothur treat that as a separate code for race (Mothur could be very aware and think I’m analyzing Native Americans)?

Many thanks in advance

If you have 138 samples in your data then you need 138 lines in the design file. To ignore samples, you can list it’s group as xxx. Alternatively, you can also use remove.groups/get.groups to get the specific groups you want.



Thanks for the quick response.

I took your advice to alter the design file by renaming ‘loc2’ as ‘XXX’. That indeed works; I did an amova using this modified design file and got back results that correlate very closely to a previous run of these data (that also lacked loc2). So that’s the quick solution.

For the remove.groups command, it isn’t clear to me from the Wiki just how I best remove all the samples I want to remove (e.g., all the ones I collected from loc2). There are 59 of these so typing them in by hand is out, even if I knew exactly what each name was. My current group file (‘stability.contigs.good.groups’) is 74.4 MB, and I’m not sure how to edit that to remove the samples I want out of there. A current count table (‘stability.trim.contigs.good.count_table’) is 12 MB and doesn’t have group information.

I looked at abrecovery.groups from your example: that’s a 3 KB file of sample names and grouping; I have no idea where to find that equivalent in my file set. Is that something that is nothing more than a design file with ‘.groups’ at the end of the filename?

Sorry to be dense.

There’s really not anything special about a design file - it’s the same as group file. The only difference is that the first column of a group file has the sequence names and the second has the sample name and in the first column of the design file is the sample name and the second column is a higher level of grouping of the samples. You could try to do something like using merge.groups to merge various groups into a single group.


Just to kind of expand this idea, there’s a student in my former lab who is looking at doing some pairwise UniFrac tests between samples and was asking me a similar question. They basically have some tissue samples from several sinus locations from a range of patients, and have done the standard 16S thing. Initially they built their OTU table/clearcut tree and calculated the relevant distance matrices to investigate the relative contributions of inter- and intra-patient variation to the data.

They now want to compare the communities obtained from each patient and see if the microbiota changes significantly between sites. For the OTU table this isn’t too difficult, because you can just do a get.groups() on the shared file and get only the relevant OTUs for calculation. For phylogenetic/UniFrac tests this is a bit more complicated because the structure of the tree is influenced by all the sequences used to build it, so the other patients can’t be removed from the data set as easily as they can in an OTU table.

Say we have 3 patients (A, B, C) sampled at 2 sites each (1, 2). The two analysis options will be to use a tree built from all the samples (A1, A2, B1, B2, C1, C2) and then just do a subset of the UniFrac tests (A1 vs A2, B1 vs B2 etc). In this case, there will be a large proportion of the sequences giving structure to the tree that aren’t really being included in the analysis. On the other hand, if they extract the reads obtained from each patient then build a tree (tree A, tree B, tree C) and then do the tests on each tree (treeA 1 vs 2, treeB 1 vs 2 etc) then to me this makes more biological sense since you’re testing community structure between two related sites, rather than including background noise/data from sites that aren’t relevant to the question being asked in each specific test.

Do you have any advice on which would be the better route to take? I guess the problem I have is that I’m not sure which way the UniFrac distance is calculated. It’s either

[Sum of unique branch length] / [Sum of total branch length of the groups]


[Sum of unique branch length] / [Sum of total branch length of the tree]

If it’s the first, then this issue shouldn’t matter at all. If it’s the later, then it’s quite important.

It’s the sum of the branch length of the sub tree for the two groups.