Hello: I wasn’t sure where to post this question!
I have a single biopsy specimen that was run through the MiSeq pipeline, in addition to a mock community and a negative control. My question is, how can I really identify what is a contaminant?? And how to appropriately deal with the negative control data?? I came across this post of yours http://blog.mothur.org/2014/11/12/TheKitome/ that basically said we have no idea what to do with negative control data but it is a bit dated… Sooo… any advice on how to tackle this??
You get bleed over across samples, right? I’ve seen this in my data and read it in a papers (somewhere). If an abundant sequence in your real sample bleeds over to the negative, then you’ll be in real trouble if you culled that sequence.
I suspect a reduction in sequence count rather than cull is best.
There really are no good options currently. I think the key is to demonstrate that you conscientiously considered the issue and dealt with contaminants accordingly.
reviving this. How would you toss all sequences that are present in neg controls (I usually have extraction neg and pcr neg). Would you toss them after the first unique.seqs or after pre.cluster?
You could probably do it by using get.groups to get a count table/fasta file of sequences found in your negative control, and then turn one of those into an accnos file for remove.seqs.
Hmm, you might want to test it, but the first step should identify all the sequences that are observed in the negative control. The second removal would remove them from the entire data set, so would remove them from the negative controls, and anywhere else they occur (leaving your negative control entries empty).
That’s a point though, you’d probably want to do this on the OTUs, not the unclustered sequences.
Crap, have to test this out today. I had a PCR with no band return 5000 seqs (it had a bright primer dimer band). Unfortunately, we’re out of DNA for those samples. So bioinformatic removal it is. I’ll report back
Hi I’d like to revive this thread. I am working on bacterial communities in human biopsy samples and poor amplification of bacteria sequences due to competing human-host sequences is expected.
Disclaimer: I’m not as bioinformatics savvy as I would like to be. I have nevertheless been following the miseq sop without major problems. I’ve even been immensly enjoying the adventure.
I am interested in the mothur community’s feedback on the pipeline proposed by Jervis-Bardy et al Microbiome (2015) 3:19. They suggest that a negative correlation between amplicon dna concentration and relative abundance for a given OTU is a good sign of a contaminant and that such OTUs should be removed.
Despite honest searching, I have not found a thread related to this subject. How could this be done in mothur? Is their reasoning sound in your opinion?
Please indicate the correct thread to go to if I have somehow missed it. -C
They created a complicated computational “cleaning” procedure rather than using the negative controls? I’m not a fan of that. That is selective removal of some of your data based on preconceived notions of what it should look like-not a fan of any of those procedures.
But I haven’t worked out how to remove all seqs that are preclustered with seqs from the neg controls yet (I realized I had more DNA for the samples I was talking about earlier in this thread)
You can use sed to remove the lines where a neg control sequence is found
sed ‘/SEQUENCENAME/d’ in.count_table > out.count_table
There is a way to feed sed a file of patterns to look for, I just can’t remember what it is. I’ll try to remember and up date this
Well pschloss suggested removing negative control sequences after preclustering so here goes. I’m doing this:
pre.cluster(fasta=blah.fasta, count=blah.count_table, diffs=2)
summary.seqs(fasta=blah.precluster.fasta, count=blah.precluster.count_table) #look at what you got
count.groups(count=blah.precluster.count_table) #look at what you got in each group
get.groups(fasta=blah.precluster.fasta, count=blah.precluster.count_table, groups=BN1-BN2) #single out the negative control groups here; my "BN"s
summary.seqs(fasta=blah.precluster.pick.fasta, count=blah.precluster.pick.count_table) #look at whats in the negative control just in case
system(rename blah.precluster.pick.fasta neg_control.fasta) #rename neg control fasta file something nicer
system(rename blah.precluster.pick.count_table neg_control.count_table) #rename neg control count file something nicer
list.seqs(count=neg_control.count_table) #generate accnos file for neg control
remove.seqs(accnos=neg_control.accnos, fasta=blah.precluster.fasta, count=blah.precluster.count_table) #remove sequences (just like for chimeras)
summary.seqs(fasta=blah.precluster.pick.fasta, count=blah.precluster.pick.count_table) #make sure fewer sequences
count.groups(count=blah.precluster.pick.count_table) #make sure negative control groups disappear
This appears to work for me. Am I horribly wrong? I would like for another person to try this and give feedback please.
I’m thinking analyses should be performed with and without negative control removal, to see the potential affect it has, discuss the taxa that are “removed” etc. Any thoughts?
What would be the advantage of removing negative control sequences before doing chimera searching and removal? Would it be better to remove chimeras first, and then remove negative control sequences?
I’m also very interested in this. I gave your protocol a try with my data, trying several different options:
Removing all sequences present in any negative control / all negative controls / something in between
Removing all sequences whose relative abundance wasn’t at least twice in samples vs neg controls
I tried both the precluster stage and after chimera removal but didn’t yet directly compare these (should I expect any difference other than perhaps processing time?).
At a quick glance this seems at least much better than screening at the OTU level which I tried previously (and which was very obviously far too aggressive). And it seems (again at very first glance) it’s fairly robust in the sense that the different options above probably generated fairly equal results. But I need to look a bit more carefully to confirm this.
I have now had a closer look at the contamination screening trial and I’m still pretty happy with it.
Removing sequences present in the negative controls at the precluster or chimera phase deletes roughly ~50% of data in my very-low-biomass samples, including obvious reagent contaminants (such as thermophilic bacteria which should not be there) and of course probably also some real sample microbes. Most of my high-biomass samples are not much affected.
Obviously there will be false negatives due to low-level crosscontamination or the same species really existing in the samples and the controls (based on 16S qPCR, I expected something like 25% to be contamination).
But can you think of any reason why this could still allow false positives?
I tried both precluster stage and after removing chimeras and very rare sequences. In my case, it seems better to do this AFTER chimera removal. If I do it before chimera removal, I end up with a number of OTUs which can’t be classified at the phylum level, so I guess more chimeras remain undetected.
I compared deleting everything present in any negative control (maybe a bit too strict) and deleting everything EXCEPT if at least twice as abundant in the samples than in the controls (in relative terms). These produced pretty similar results. I also tried deleting only those sequences which are present in ALL negative controls, but this is too relaxed (still left almost 50% of negative control data intact).