Removing contaminants

kfontanez · January 14, 2011, 7:28pm

Hello-

I am trying to remove contaminating sequences from my 454 dataset. We sequenced our negative control and I would like to remove the sequences that are found in that sample from the rest of my samples. In previous studies, I created a blast database from my samples and then queried it using a contaminant.fasta file. Then I removed those sequences from the fasta file using a custom perl script. I am wondering how I might do this using the available commands in mothur.

I am starting with a trimmed fasta file that contains only unique sequences for the total dataset called CEdata.trim.unique.fasta and a corresponding CEdata.trim.names and CEdata.groups file.

summary.seqs(fasta=CEdata.trim.unique.fasta)

Start End NBases Ambigs Polymer Minimum: 1 200 200 0 3 2.5%-tile: 1 210 210 0 4 25%-tile: 1 225 225 0 5 Median: 1 230 230 0 5 75%-tile: 1 236 236 0 5 97.5%-tile: 1 246 246 0 6 Maximum: 1 270 270 0 6 # of Seqs: 7041 <-UNIQUE SEQS

I then use the split.groups command to create fasta files for each individual sample (30 samples).
Output File Names:
CEdata.trim.unique.ZSW5_2.fasta
CEdata.trim.unique.ZSW4_1.fasta
CEdata.trim.unique.ZSW4_2.fasta
CEdata.trim.unique.ZSW3_2.fasta
CEdata.trim.unique.ZSW5_1.fasta
CEdata.trim.unique.ZSW3_1.fasta
CEdata.trim.unique.CSW4_2.fasta
CEdata.trim.unique.HSW9_2.fasta
CEdata.trim.unique.HSW4_2.fasta
CEdata.trim.unique.HSW7_1.fasta
CEdata.trim.unique.HSW9_1.fasta
CEdata.trim.unique.HSW5_2.fasta
CEdata.trim.unique.HEM.fasta
CEdata.trim.unique.CSW6_1.fasta
CEdata.trim.unique.CSW7_1.fasta
CEdata.trim.unique.CSW8_2.fasta
CEdata.trim.unique.CSW10_2.fasta
CEdata.trim.unique.CSW3_1.fasta
CEdata.trim.unique.HSW5_1.fasta
CEdata.trim.unique.ZSWLB.fasta
CEdata.trim.unique.CSWLB.fasta
CEdata.trim.unique.ZEM.fasta
CEdata.trim.unique.HSWLB.fasta
CEdata.trim.unique.CEM.fasta
CEdata.trim.unique.RSW5_1.fasta
CEdata.trim.unique.COP50_1.fasta
CEdata.trim.unique.ZOP50_1.fasta
CEdata.trim.unique.HOP50_2.fasta
CEdata.trim.unique.RSW9_1.fasta
CEdata.trim.unique.RSW4_1.fasta

I would like to remove all sequences found in HSWLB from all samples beginning with the H identifier. I would like to do the same with CSWLB and C identifier and ZSWLB and the Z identifier. At this point I think I would like to use the classify.seqs command to classify the sequences in HSWLB, CSWLB and ZSWLB (using Bayesian or blast). However, here is where I run into the problem. The classification is just a taxonomic assignment with a certain level of confidence. If a sequence in CSWLB is assigned to Bacteriodetes with 100% confidence is it right to assume that I ought to remove all Bacteriodetes from samples with C identifier? It seems to me that there might be other Bacteriodetes that are in those samples that are not 100% identical to the sequence in CSWLB but will still be assigned to Bacteriodetes with 100% confidence. That is why I think that blast might be a better option. With blast, I can use perl scripting to remove sequences in C identifier samples that are 97% or higher identical to any sequence found in CSWLB over 200 nt. Is it possible to do this using the classify.seqs blast option? I basically need to be comparing sequences base to base, not taxonomic classification to taxonomic classification.

Thank you,
Kris

pschloss · January 17, 2011, 12:32pm

Kris,

Great question. Unfortunately, as you’ve gathered, it’s not possible to easily do what you want. First thing - keep in mind that 100% confidence doesn’t necessarily mean 100% identity. I think the taxonomic scheme you’ve outlined is not what you really want to do. Here’s one option [keep in mind that the exact syntax might be a bit off]…

Take all of the HSW fasta samples and merge them together (cat CEdata.trim.unique.HSW*.fasta > HSW.fasta)
Run unique.seqs on HSW.fasta
list.seqs(fasta=CEdata.trim.unique.HSWLB.fasta)
Run remove.seqs(fasta=HSW.unique.fasta, accnos=CEdata.trim.unique.HSWLB.accnos, name=HSW.names)
Proceed as normal…

Alternatively, if you want to remove any OTU that has sequences more than 3% similar to the sequences in HSWLB you could do the following…

Using all of the sequences, generate a shared file
Scan across the 0.03-labelled line in the shared file and identify any column containing sequences from HSWLB
Remove those columns using excel
Decrement the number of OTUs in the shared file by the number of columns you removed.

This final option might be preferred since it would give you the option of customizing your contamination removal a bit better. For example, you might expect a contaminant to show up in every HSW sample, or you might expect it to be relatively rare, etc.

Any other ideas people have?

kfontanez · January 23, 2011, 10:53pm

Pat,

Thanks for your helpful information. The second method you proposed works well for me.

Best,
Kris

johnk · March 26, 2015, 9:01pm

I know this is an old post, but it was super useful!! I think this is a common (and increasingly important) question. Thank you Pat~

Topic		Replies	Views
very naive question on fasta file Commands in mothur	1	2131	March 8, 2013
Memory error with remove.seqs Commands in mothur	1	744	February 10, 2017
Removing (contaminating) sequences AFTER OTU clustering Commands in mothur	1	850	May 18, 2017
41 sample sff files, with barcodes removed! How to proceed Commands in mothur	1	1031	July 7, 2016
Remove.lineage output file Commands in mothur	4	854	July 13, 2019

Removing contaminants

Related topics