Duplicate data runs with non-duplicated results

I have a dataset of lung (low biomass) samples of a group of patients with asthma. I do two data runs with the same FASTQ files.

Run 1 – includes all my controls, etc, about 140 samples all together. I run all the way through the Mi-Seq SOP with all the usual defaults and the Silva database. I look at taxonomy at the genus level and select a genus (e.g., Pseudomonas). Each sample in the shared and subsampled shared file has a count for Pseudomonas.

Run 2 – removes the controls based on advice I’ve been given, as well as some samples that do not make a specified quality cut, so I’m down to about 95 samples. As before I run the Mi-Seq SOP and use the same Silva database. Now the shared and subsampled shared genus taxonomy files, for the samples that were also present in run 1, have DIFFERENT counts for Pseudonomas. These differ by a lot.

These are the same FASTQ files, organized in two different runs, same SOP, differing only in that some FASTQ files were excluded from the beginning. One would think that sample #1, in run 1 and run 2, would have the same count at the genus level for the OTU that represents Pseudomonas. But they don’t. This is true for SOME of the other entries in the taxonomy table at phylum, order, family and genus level. But (for example) Lactobacillus compared very nicely across the two runs.

Thoughts? Does removing some samples cause the counts for each sample to change? Might this be the results of changes to filtering for chimeras, duplicates, etc during run 2 based on a different number of samples being present?

Update: this is curious but may help solve the problem.

At the phylum level between run 1 and run 2: recall that run 1 is 140 samples of all my data plus reagent controls, etc, whereas run 2 is 95 samples excluding the reagent controls and some patient samples (excluded because their DNA load after PCR amplification was below a cut-off).

Now then, run 1 – per the taxonomy table, OTU 001 is Firmicutes and OTU 002 is Proteobacteria. Whereas in run 2, it is the opposite: OTU 001 is Proteobacteria and OTU 002 is Firmicutes. In each run I have modestly more Proteobacteria, and why the taxonomy table reverses the labeling isn’t clear. But it DOES become important.

You see, the natural instinct in comparing the two sets is to compare the Firmicutes in each set, the Proteobacteria in each set, etc. If I do a simple scatter plot of Firmicutes for each of the matched subjects, run 1 versus run 2, from the shared file I get nonsense. Ditto for Proteobacteria.

BUT – if I plot OTU 001 for run 1 versus run 2, again matched subject for subject, I get a straight line on the line of identity, R2 = 0.999964. Ditto for OTU 002 for run 1 versus run 2 with R2 = 0.99847.

So looking at the labels assigned by the taxonomy table hoses me, but comparing OTU to OTU is perfect.

Same for the taxonomy tables at the genus level for run 1 versus run 2: in my original post Pseudomonas (otu007 in run 1 and otu004 in run2 per the labels provided in the tables) gives me a random scatter plot. But if I plot genus OTU 004 (“Prevotella” in run 1 and “Pseudomonas” in run 2), I get a straight line with R2 = 0.9999985.

So it seems that the OTUs compare correctly but the labels do not.

What did I do wrong? I’m merely an M.D. so I’d like to use phyla and genera names and not OTUs.

Second update: more interesting.

Both run 1 and run 2 were done with Mothur 1.33.3

I did another run, run 3, yesterday, a duplicate of run 2 in every way except I used Mothur 1.34 (December 2014).

This time the OTUs in the various taxonomy tables, from phylum to genus, are listed closer to the same order as run 1 and NOT at all as run 2. As an example, for phyla, run 1 is this:

Otu001 542855 Bacteria(100);Firmicutes(100);… (higher counts in run 1 due to more samples, never mind that)
Otu002 669907 Bacteria(100);“Proteobacteria”(100);…
Otu003 225487 Bacteria(100);unclassified(100);…
Otu004 210718 Bacteria(100);“Bacteroidetes”(100);…
Otu005 51040 Bacteria(100);“Fusobacteria”(100);…
Otu006 66443 Bacteria(100);“Actinobacteria”(100);…

and Run 3 (Mothur 1.34) is this: note the same order of phyla names

Otu001 388045 Bacteria(100);Firmicutes(100);…
Otu002 413355 Bacteria(100);“Proteobacteria”(100);…
Otu003 127864 Bacteria(100);unclassified(100);…
Otu004 182709 Bacteria(100);“Bacteroidetes”(100);…
Otu005 47406 Bacteria(100);“Fusobacteria”(100);…
Otu006 54528 Bacteria(100);“Actinobacteria”(100);…

but Run 2 (same dataset as run 3 but Mothur 1.33) is this:

Otu001 413163 Bacteria(100);“Proteobacteria”(100);…
Otu002 388172 Bacteria(100);Firmicutes(100);…
Otu003 182634 Bacteria(100);“Bacteroidetes”(100);…
Otu004 127655 Bacteria(100);unclassified(100);…
Otu005 54532 Bacteria(100);“Actinobacteria”(100);…
Otu006 47396 Bacteria(100);“Fusobacteria”(100);…

Note the different order of OTUs as named: 1 and 2 are flipped, 3 and 4 are flipped, and 5 and 6 are flipped compared to run 3, even though the COUNTS line up with the names.

The same is true if I look at genus-level taxonomy tables: run 1 (Mothur 1.33) and run 3 (Mothur 1.34) order the OTUs from 1 to 560 with the same taxonomy names, whereas run 2 (Mothur 1.33, but same dataset as run 3) orders them quite differently.

I’m inclined to use the data from run 3 as the OTU order and naming fits my data with run 1, and to disregard run 2 as an aberration – perhaps an issue with Mothur, perhaps something I did. Thoughts?

Sorry about the stream of consciousness posts but I keep digging on this.

Sorry, but I’m getting a little lost in your posts, but I think your concern is that the OTU numbers are varying between runs or with different datasets or mothur versions. The OTU numbers (i.e. OTU1 vs OTU2) themselves do not matter as long as they have the right number of sequences and classification. So that you have…

Otu001 388045 Bacteria(100);Firmicutes(100);…
Otu002 413355 Bacteria(100);“Proteobacteria”(100);…

vs.

Otu001 413163 Bacteria(100);“Proteobacteria”(100);…
Otu002 388172 Bacteria(100);Firmicutes(100);…

Doesn’t matter. There will be some small differences in the number of reads between runs and versions because of some randomness that is introduced at various steps. Let me know if this doesn’t clarify the problem.

Pat

Hi Pat,

I have read through your response above and it makes sense. I was just curious if you could expand a bit on the points in the pipeline that introduce randomness in the results, especially those that may change with the versions?

I have always understood how differences could arise in the de novo clustering process because a sequence may equally belong to 2 different OTUs and a choice just needs to be made. I have recently been working on a project where I needed to utilize phylotypes in addition to the de novo OTU generation and I have noticed the results slightly changing after I started with the newest version. It wasnt as readily apparent which steps of the mothur pipeline are capable of introducing randomness with the phylotype process. I ask because I am working with some collaborators and I needed to change versions of mothur in the process of the project because a feature I needed wasnt working correctly. After the change, the results change in a way that ultimately impacts some borderline p-values and whether or not the meet the very arbitrary predefined alpha. I just wanted to be adequately prepared to explain to my collaborators why there are these slight changes in results. Thanks for your time and expertise!

Casey

classify.seqs has a random component to it that is used to measure the confidence score for the classification. Also, any of the non-parametric hypothesis testing routines (e.g. amova/homova) will have a random component. For all of these, if you are concerned about borderline cases, you can increase the number of iters

Sounds great, thank you for your response. I ended up doing exactly as you said and increased the iters on the metastats analysis. Thanks again!