I’ve been working with a dataset and have a problem I can’t quite figure out. The data comes from a MiSeq and I’ve processed it per instructions in the MiSeq and 454 example pages (thank you, thank you for those). I am now at the ‘fun’ part of my data.
The data looks like this: I am studying the lung microbiome in human subjects. One group of subjects have a disease (n = 40), one group is control (n = 20). I have two samples collected from each subject, based on location in the lung – let’s call them loc1 and loc2. I also have a set of negative, non-patient controls (N = 20). So each fastq file in the end looks like this:
subj1loc1.R1.fastq
subj1loc2.R1.fastq
.
.
.neg1R1.fastq
And so on.
I’ve run the MiSeq and 454 scripts of all the fastq files all the way through without a problem as a single dataset. Now the subjects are of two different ancestries: European (EA) and African (AA). I now get to do a fun analysis like this one:
unifrac.weighted(tree=dataset.otu.thetayc.unique.tre, group=race.design, random=T)
where I’m using my entire dataset and a design file, race.design, to separate the subjects by race. This executes perfectly in Mothur.
Here’s the question: I now want to look JUST at loc1 by race, and then loc2 by race. Let’s generate a new design file (e.g., race.loc1.design) that lists just the loc1 files for each subject along with all the negative controls:
unifrac.weighted(tree=dataset.otu.thetayc.unique.tre, group=race.loc1.design, random=T)
This fails: I get 40 error messages, each on a separate line, that look like this:
[ERROR]: Your group file does not contain SUBJ1LOC2. Please correct.
Etc etc for each of the loc2 samples. Then I get one last error message:
Name: SUBJ20LOC2 is not in your groupfile, and will be disregarded.
[ERROR]: Your count table contains more than 1 sequence named SUBJ20LOC2, sequence names must be unique. Please correct.
error with lc
Sorry for being long-winded, but I want you to see the situation. My question is simple: having processed my entire dataset (all subjects, both locations, etc) as a single unit through the MiSeq and 454 protocols, how do I now analyze just one part of the dataset? Just loc1, or just loc2 for example?
I’m concerned that if I just (in the master race.design file) set loc2 to ‘null’, for example, that those data will be analyzed anyway.
Many thanks in advance for what I hope is a simple answer to a long question!