How good is good enough (analysis of mock community)?

brindha_lekshmisaran · May 14, 2015, 5:08pm

Hi,
In comparing analysis of sequence data prepared from mock community, how closely in the final analysis (i.e. post curation) should the data output reflect the true identity of the mock community? What are the acceptable tolerances of experimental error in data creation and analysis?
Is it possible to ascribe or apply a statistical perspective to adequately rank experimental vs actual community composition?

My mock is equimolar and has 20 species.

Kind regards,
Brindha.

quantrix · May 20, 2015, 4:31pm

Hello Brindha,
Most of the times Pat and Sarah answer, but I want to contribute to the community as well. So here are my 2 cents.

That is a very good question. I have been dealing with those same issues and doing a ton of reading. One paper I can highly recommend is

“Ironing out the wrinkles in the rare biosphere through improved OTU clustering” By Huse et al.

Now, if you adapt a OTU bases analysis, the resulting number of OTUs are a function of (what I have understood thus far)

Your sequencing error rate
Cutoff for clustering
Number of 16s operons in each of your organisms (some bacteria have more than ONE 16s operons. Did you know that?)
Variability of the 16s sequences
The precise algorithm used (Complete linkage, Average Linkage or Single Linkage)

Ultimately, it is a interplay of ALL of these factors which determines the number of OTUs you obtain. So I do not think there is a straightforward answer about HOW MANY OTUs should you get. If you use a Phylogeny based approach, you WILL get better results for a mock community, but there is no guarantee you will get those same results on your actual samples (it is database dependent). Ultimately, what I have realized is that OTUs are just a “computational” way of aggregating sequences together and comparing it between samples.

i.e., OTU’s != Bacterial strains.

pschloss · May 25, 2015, 2:25pm

Thanks for the question and for contributing, quantrix.

My philosophy is that the level to which the output resembles the expectation is not what we’re after. The number of OTUs out the back end can be affected by the number of reads that went into the analysis. More reads, more garbage, more OTUs. We focus on the error rates. Generally, we see error rates below 0.0002 or 0.02%. I would worry if you are at 0.1% or above.

Also, good for you for getting a mock community sequenced! Be sure to report your error rate in your paper’s methods section.

Pat

AndrewM · July 21, 2015, 5:51pm

Hi Pat,

I’m currently analysing some mock community 16S sequencing data and have a couple of questions.

The way I see it, you have two things you need to take into account in order to check that your methods are not altering the true biological signal in your sample (in this case the mock community):

Presence/Absence: Check to see if after your experiments (and analysis) you have the same number of species as the mock communities.
Abundance: Check to see if PCR amplification and other stuff is not altering the true abundance of your mock community.

What do you mean when you say to check the error rates? Of both presence/absence and the abundance profile?

If’ve checked and in my data there are 19 OTUs and they account for all species in the mock community except one (I used the BEI 782 mock community B even concentration).
However, I’m not quite sure how to analyse the abundance part of this data, if you could point in the right direction I would appreciate it.

Thanks,

Andrew

pschloss · July 24, 2015, 2:04pm

Presence/Absence: Check to see if after your experiments (and analysis) you have the same number of species as the mock communities.

Actually… I don’t really care about this. If you sequence 1000 reads you could get dead on 20 OTUs (or whatever) like you might expect. If you did 10,000 reads, you might get 50 OTUs. The difference is the sampling depth. The thing in common is the error rate. If we reduce the error rate, then we know we’ll still get extra OTUs, but the rate of extra OTUs is likely to be the same across you samples.

Now… if your PCR primers are biased (and they all are) then you are likely to not get things to amplify. We see this with the V4 primers - they don’t really amplify P. acnes. The typical 8F primer doesn’t amplify Bifidobacteria. Etc… So you have to pick your primers to match the expected biodiversity of your community.

Abundance: Check to see if PCR amplification and other stuff is not altering the true abundance of your mock community.

This is a PCR issue, not a sequencing issue and the field has long suspected that there are biases in PCR using multi template DNA pools. To quantify this, you would really need to use qPCR to quantify the individual genomes in your template and then run the samples through your pipeline and see if what you get out is the same as the input. Because of errors in DNA quantification, pipetting, etc. you can’t take it on faith that what is billed as an “even mixture” is really even. In the end, however, what is one to do if the input and output don’t match? We get over this by acknowledging there are biases and asserting that the biases equally botch all of the samples.

Hope this helps.
Pat

AndrewM · July 28, 2015, 3:09pm

Hi Pat,

Thanks for your helpful input.

I’ll calculate the error rates using multiple subsamplings at different sampling depths and see what I get.

Regarding the abundance part, I also suspect this bias is due to PCR amplification and I’ll keep this in mind when analyzing the data.

In case others are interested, I found this article that might be useful:

Thanks,

Andrew

Topic		Replies	Views
mock community - spurious OTUs Theory behind mothur	1	2610	February 11, 2015
OTUs number too high Theory behind mothur	7	8538	January 26, 2016
Objection of reviewer about number of OTUs	11	1635	June 13, 2019
OTUs vs ASVs Theory behind mothur	6	10041	December 5, 2019
Mock community Theory behind mothur	2	4710	October 11, 2013

How good is good enough (analysis of mock community)?

Related topics