MiSeq, MOTHUR pipeline and low biomass

Hello all

Here’s an issue that cropped up in a review of a manuscript of ours at a journal. This is my first microbiome-related submitted manuscript. Without getting into journal-specific issues, a reviewer made a couple comments about our work using MiSeq and MOTHUR that I thought I would bring here. Our manuscript looks at the lung microbiome in a cohort of control subjects and subjects with asthma, a situation in which we have a low biomass for each sample – unavoidable, it is what it is. We extracted our DNA, did V4-region PCR amplification, then MiSeq, and then used MOTHUR’s MiSeq pipeline (per the FAQ) to get us to the point of data analysis. Our friendly reviewer had three vehement issues for us, and I ask for a little advice here (sorry for the length) –

Issue #1: The reviewer notes an issue that was brought up in a paper published in November by Salter (1). Our study has, as part of the dataset, some negative reagent-only controls (DNA extraction kits, etc). As Salter notes, while these are thought to be “sterile”, in reality there are some sequences in these solutions that might confound an analysis. Indeed, for some (not all) of our reagent controls, while the amount/concentration of DNA submitted for analysis generally was lower (per nanodrop) than for our patient samples, it’s hard to tell the difference between the nseqs and sobs of these controls versus at least some of the patient samples. I’d like to hear what others are doing with low biomass situations with regard to reagent controls: does one “subtract out” these sequences from one’s samples, or are there other ways to handle this? One of my colleagues who is looking at the issues says that it’s quite “complex”, to which I agree, but that doesn’t help me deal with this reviewer.

Issue #2: The reviewer notes that in 454 pyrosequencing, ultra low DNA levels can be sequenced but at a much reduced efficiency (i.e. # of reads) and random sequencing artifacts are reduced. His point (I think) is that I wouldn’t have this issue if I had only done 454, since most of the ‘sequences’ in my reagents controls (and perhaps in some of my samples) simply wouldn’t have been picked up. I’m not willing to go to an older technology (one that our university is actually phasing out); MiSeq, etc is the current wave. So how best to respond to the objection “if only you had used 454 you wouldn’t have this problem?”

Issue #3: in part because MiSeq is more sensitive than 454, we identified 24 phyla in our patient samples. Per the phylum taxonomy table in MOTHUR, the most numerous 6 phyla (Firmicutes, Proteobacteria, Bacteroidetes, Fusobacteria, Actinobacteria and ‘unclassified’) accounted for ~99% of the total nseqs (‘size’). Proteobacteria, the most numerous, has a size of 669907 (38% of total counts), phylum #6, Fusobacteria, has a size of 51040, whereas phylum #7 in rank, Acidobacteria, has a size of 5988. Half of the phyla have counts < 1000 and the bottom 6 have a summed count of < 400. You can see the issue. In the manuscript we noted this and focused (we thought appropriately) on the top 6. Our reviewer had a major disagreement (I’m being polite) and said that the fact that we identified TWENTY-FOUR PHYLA!!! meant that our data was total crap. We had a similar issue with taxonomy at each level down to genera: we identified sequences belonging to 605 genera, but only the top 16 had a size that for each was > 1% of the total, and only five were > 5%. Question: how do others here handle rare, sparse counts of phyla, genera, OTUs, etc? This might be a reporting issue more than anything – may I draw a line (somewhere) and say that anything below the line doesn’t need even to be reported?

All thoughts greatly appreciated.


[1] Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW. BMC Biol. 2014 Nov 12;12:87.

Issue #1: The reviewer notes an issue that was brought up in a paper published in November by Salter (1). Our study has, as part of the dataset, some negative reagent-only controls (DNA extraction kits, etc). As Salter notes, while these are thought to be “sterile”, in reality there are some sequences in these solutions that might confound an analysis. Indeed, for some (not all) of our reagent controls, while the amount/concentration of DNA submitted for analysis generally was lower (per nanodrop) than for our patient samples, it’s hard to tell the difference between the nseqs and sobs of these controls versus at least some of the patient samples. I’d like to hear what others are doing with low biomass situations with regard to reagent controls: does one “subtract out” these sequences from one’s samples, or are there other ways to handle this? One of my colleagues who is looking at the issues says that it’s quite “complex”, to which I agree, but that doesn’t help me deal with this reviewer.

You’ll note that Salter doesn’t make any recommendations regarding what to actually do with the results of your negative controls :). If a sequence is contaminant the you would expect the actual sequence (not the OTU!) to be in the same abundance relative to other contaminants in the sample and the control. So I would suggest looking at your shared file at the unique level and see whether there are sequences that are shared between your controls and your samples. Then ask whether their relative relative abundances are similar enough to suggest contamination. Ultimately, I think that if you do anything you’re doing far more than anyone else. Incidentally, the fact that people rarely do anything explains why the Salter paper was such a big deal. You might say something like…

“We appreciate the reviewer’s concern regarding the presence of contaminants in our samples due to the relatively low biomass in the original samples. This is clearly an area that is very important, but that has been poorly explored for developing robust methods for culling putative contaminants. The risk of being overly aggressive in culling sequences is that many organisms that are found in the negative controls could possibly be found in our environment (e.g. Pseudomonas). Therefore, we decided to do … We appreciate that this may not be the best method; however, considering the lack of other methods in the literature we feel that this was a conservative approach that protects us against ascribing biological significance to possible contaminants.”


Issue #2: The reviewer notes that in 454 pyrosequencing, ultra low DNA levels can be sequenced but at a much reduced efficiency (i.e. # of reads) and random sequencing artifacts are reduced. His point (I think) is that I wouldn’t have this issue if I had only done 454, since most of the ‘sequences’ in my reagents controls (and perhaps in some of my samples) simply wouldn’t have been picked up. I’m not willing to go to an older technology (one that our university is actually phasing out); MiSeq, etc is the current wave. So how best to respond to the objection “if only you had used 454 you wouldn’t have this problem?”

That’s bull shit. If you would have done Sanger then you wouldn’t have the problem either. Or culturing, or microscopy. I think you point them to the earlier rebuttal.


Issue #3: in part because MiSeq is more sensitive than 454, we identified 24 phyla in our patient samples. Per the phylum taxonomy table in MOTHUR, the most numerous 6 phyla (Firmicutes, Proteobacteria, Bacteroidetes, Fusobacteria, Actinobacteria and ‘unclassified’) accounted for ~99% of the total nseqs (‘size’). Proteobacteria, the most numerous, has a size of 669907 (38% of total counts), phylum #6, Fusobacteria, has a size of 51040, whereas phylum #7 in rank, Acidobacteria, has a size of 5988. Half of the phyla have counts < 1000 and the bottom 6 have a summed count of < 400. You can see the issue. In the manuscript we noted this and focused (we thought appropriately) on the top 6. Our reviewer had a major disagreement (I’m being polite) and said that the fact that we identified TWENTY-FOUR PHYLA!!! meant that our data was total crap. We had a similar issue with taxonomy at each level down to genera: we identified sequences belonging to 605 genera, but only the top 16 had a size that for each was > 1% of the total, and only five were > 5%. Question: how do others here handle rare, sparse counts of phyla, genera, OTUs, etc? This might be a reporting issue more than anything – may I draw a line (somewhere) and say that anything below the line doesn’t need even to be reported?

Again, I think you point them back to the first point, indicate that you have used the most stringent methods for curtaing your data. You can also agree with them that you were surprised to find so many phyla after removing the contaminants and point to that as being an exciting aspect of the study. But I would really hammer the point that you’re being stringent in your data quality and how you deal with contaminants.

Good luck!
Pat

Pat, many thanks!

Can I quote you on the “bullshit” part? :smiley:

The issue of contamination is indeed vexing, and colleagues here are pretty much shrugging their shoulders – there’s no good method for handling this. I’ll look at the shared file as you suggest – particularly if the relative abundance is similar in all the reagent controls (done on different days as we did each batch of DNA extractions) that indeed might suggest that certain OTUs are contaminant. I think what I’ll do then is point that out in the revised manuscript and allow the reader to weigh the importance.

For the high number of phyla/genera, I think I’ll take your advice as well and also point to the sparsity of sequences more explicitly – yes there were 24 phyla but the bottom xx have nseqs that are < 0.1%, etc.

Again, much appreciated.

Good luck - and always remember that my snark regularly gets me in trouble! Your mileage may vary in quoting me :slight_smile: