negative control

Hello: I wasn’t sure where to post this question!
I have a single biopsy specimen that was run through the MiSeq pipeline, in addition to a mock community and a negative control. My question is, how can I really identify what is a contaminant?? And how to appropriately deal with the negative control data?? I came across this post of yours that basically said we have no idea what to do with negative control data but it is a bit dated… Sooo… any advice on how to tackle this??


The simplest thing to do would be to toss any sequence from your biopsy sample that shows up in your negative control.


You get bleed over across samples, right? I’ve seen this in my data and read it in a papers (somewhere). If an abundant sequence in your real sample bleeds over to the negative, then you’ll be in real trouble if you culled that sequence.

I suspect a reduction in sequence count rather than cull is best.

There really are no good options currently. I think the key is to demonstrate that you conscientiously considered the issue and dealt with contaminants accordingly.


reviving this. How would you toss all sequences that are present in neg controls (I usually have extraction neg and pcr neg). Would you toss them after the first unique.seqs or after pre.cluster?

I’d chuck them after precluster. You would have to do this outside of mothur.


Could you add a dereplicate (a la chimera checking) option to remove.seqs?

otherwise, I’ll try to figure out how to do it. maybe grep? it’d be a matter of removing whole lines from both fasta and count_table right?

You could probably do it by using get.groups to get a count table/fasta file of sequences found in your negative control, and then turn one of those into an accnos file for remove.seqs.

Something like:

get.groups(fasta=blah.fasta, count=blah.count_table, groups=Negative1-Negative2)

grep ">" blah.pick.fasta | sed "s/>//g" > badseqs.accnos

remove.seqs(fasta=blah.fasta, count=blah.count, accnos=badseqs.accnos

Or if there aren’t too many sequences, you could just open the count table in Excel and copy out the sequence names.

I think that would just remove the neg control seqs rather than the neg controls and all the seqs that bin with any neg seqs. Right?

Hmm, you might want to test it, but the first step should identify all the sequences that are observed in the negative control. The second removal would remove them from the entire data set, so would remove them from the negative controls, and anywhere else they occur (leaving your negative control entries empty).

That’s a point though, you’d probably want to do this on the OTUs, not the unclustered sequences.

Crap, have to test this out today. I had a PCR with no band return 5000 seqs (it had a bright primer dimer band). Unfortunately, we’re out of DNA for those samples. So bioinformatic removal it is. I’ll report back

Instead of:

grep “>” blah.pick.fasta | sed “s/>//g” > badseqs.accnos

You could use mothur’s list.seqs command:

mothur > list.seqs(fasta=blah.pick.fasta) - outputs accnos file containing names of sequences in fasta file

Sarah will that remove all the sequences that are binned with the negatives?

Hi I’d like to revive this thread. I am working on bacterial communities in human biopsy samples and poor amplification of bacteria sequences due to competing human-host sequences is expected.

Disclaimer: I’m not as bioinformatics savvy as I would like to be. I have nevertheless been following the miseq sop without major problems. I’ve even been immensly enjoying the adventure.

I am interested in the mothur community’s feedback on the pipeline proposed by Jervis-Bardy et al Microbiome (2015) 3:19. They suggest that a negative correlation between amplicon dna concentration and relative abundance for a given OTU is a good sign of a contaminant and that such OTUs should be removed.

Despite honest searching, I have not found a thread related to this subject. How could this be done in mothur? Is their reasoning sound in your opinion?

Please indicate the correct thread to go to if I have somehow missed it. -C

They created a complicated computational “cleaning” procedure rather than using the negative controls? I’m not a fan of that. That is selective removal of some of your data based on preconceived notions of what it should look like-not a fan of any of those procedures.

But I haven’t worked out how to remove all seqs that are preclustered with seqs from the neg controls yet (I realized I had more DNA for the samples I was talking about earlier in this thread)

You can use sed to remove the lines where a neg control sequence is found

sed ‘/SEQUENCENAME/d’ in.count_table > out.count_table

There is a way to feed sed a file of patterns to look for, I just can’t remember what it is. I’ll try to remember and up date this

Wouldn’t the following sequence of commands do the trick?

get.groups(fasta=blah.precluster.fasta, count=blah.precluster.count_table, groups=negative_control)
remove.seqs(accnos=blah.precluster.pick.accnos, fasta=blah.precluster.fasta)

If it doesn’t because associated bins are not removed,… well then… how can we do the equivalent of a “list.bins” command and a “remove.bins” command?

Well pschloss suggested removing negative control sequences after preclustering so here goes. I’m doing this:

pre.cluster(fasta=blah.fasta, count=blah.count_table, diffs=2)
summary.seqs(fasta=blah.precluster.fasta, count=blah.precluster.count_table) #look at what you got
count.groups(count=blah.precluster.count_table) #look at what you got in each group

get.groups(fasta=blah.precluster.fasta, count=blah.precluster.count_table, groups=BN1-BN2) #single out the negative control groups here; my "BN"s
summary.seqs(fasta=blah.precluster.pick.fasta, count=blah.precluster.pick.count_table) #look at whats in the negative control just in case
system(rename blah.precluster.pick.fasta neg_control.fasta) #rename neg control fasta file something nicer
system(rename blah.precluster.pick.count_table neg_control.count_table) #rename neg control count file something nicer

list.seqs(count=neg_control.count_table) #generate accnos file for neg control
remove.seqs(accnos=neg_control.accnos, fasta=blah.precluster.fasta, count=blah.precluster.count_table) #remove sequences (just like for chimeras)
summary.seqs(fasta=blah.precluster.pick.fasta, count=blah.precluster.pick.count_table) #make sure fewer sequences
count.groups(count=blah.precluster.pick.count_table) #make sure negative control groups disappear

This appears to work for me. Am I horribly wrong? I would like for another person to try this and give feedback please.

I’m thinking analyses should be performed with and without negative control removal, to see the potential affect it has, discuss the taxa that are “removed” etc. Any thoughts?

What would be the advantage of removing negative control sequences before doing chimera searching and removal? Would it be better to remove chimeras first, and then remove negative control sequences?

1 Like

Thanks for the suggestion. I’ll try it hopefully next week

I’m also very interested in this. I gave your protocol a try with my data, trying several different options:

  • Removing all sequences present in any negative control / all negative controls / something in between
  • Removing all sequences whose relative abundance wasn’t at least twice in samples vs neg controls

I tried both the precluster stage and after chimera removal but didn’t yet directly compare these (should I expect any difference other than perhaps processing time?).

At a quick glance this seems at least much better than screening at the OTU level which I tried previously (and which was very obviously far too aggressive). And it seems (again at very first glance) it’s fairly robust in the sense that the different options above probably generated fairly equal results. But I need to look a bit more carefully to confirm this.

With best regards, Mikael

I have now had a closer look at the contamination screening trial and I’m still pretty happy with it.

Removing sequences present in the negative controls at the precluster or chimera phase deletes roughly ~50% of data in my very-low-biomass samples, including obvious reagent contaminants (such as thermophilic bacteria which should not be there) and of course probably also some real sample microbes. Most of my high-biomass samples are not much affected.

Obviously there will be false negatives due to low-level crosscontamination or the same species really existing in the samples and the controls (based on 16S qPCR, I expected something like 25% to be contamination).

But can you think of any reason why this could still allow false positives?

I tried both precluster stage and after removing chimeras and very rare sequences. In my case, it seems better to do this AFTER chimera removal. If I do it before chimera removal, I end up with a number of OTUs which can’t be classified at the phylum level, so I guess more chimeras remain undetected.

I compared deleting everything present in any negative control (maybe a bit too strict) and deleting everything EXCEPT if at least twice as abundant in the samples than in the controls (in relative terms). These produced pretty similar results. I also tried deleting only those sequences which are present in ALL negative controls, but this is too relaxed (still left almost 50% of negative control data intact).