Hello-
I am trying to remove contaminating sequences from my 454 dataset. We sequenced our negative control and I would like to remove the sequences that are found in that sample from the rest of my samples. In previous studies, I created a blast database from my samples and then queried it using a contaminant.fasta file. Then I removed those sequences from the fasta file using a custom perl script. I am wondering how I might do this using the available commands in mothur.
I am starting with a trimmed fasta file that contains only unique sequences for the total dataset called CEdata.trim.unique.fasta and a corresponding CEdata.trim.names and CEdata.groups file.
summary.seqs(fasta=CEdata.trim.unique.fasta)
Start End NBases Ambigs Polymer Minimum: 1 200 200 0 3 2.5%-tile: 1 210 210 0 4 25%-tile: 1 225 225 0 5 Median: 1 230 230 0 5 75%-tile: 1 236 236 0 5 97.5%-tile: 1 246 246 0 6 Maximum: 1 270 270 0 6 # of Seqs: 7041 <-UNIQUE SEQS
I then use the split.groups command to create fasta files for each individual sample (30 samples).
Output File Names:
CEdata.trim.unique.ZSW5_2.fasta
CEdata.trim.unique.ZSW4_1.fasta
CEdata.trim.unique.ZSW4_2.fasta
CEdata.trim.unique.ZSW3_2.fasta
CEdata.trim.unique.ZSW5_1.fasta
CEdata.trim.unique.ZSW3_1.fasta
CEdata.trim.unique.CSW4_2.fasta
CEdata.trim.unique.HSW9_2.fasta
CEdata.trim.unique.HSW4_2.fasta
CEdata.trim.unique.HSW7_1.fasta
CEdata.trim.unique.HSW9_1.fasta
CEdata.trim.unique.HSW5_2.fasta
CEdata.trim.unique.HEM.fasta
CEdata.trim.unique.CSW6_1.fasta
CEdata.trim.unique.CSW7_1.fasta
CEdata.trim.unique.CSW8_2.fasta
CEdata.trim.unique.CSW10_2.fasta
CEdata.trim.unique.CSW3_1.fasta
CEdata.trim.unique.HSW5_1.fasta
CEdata.trim.unique.ZSWLB.fasta
CEdata.trim.unique.CSWLB.fasta
CEdata.trim.unique.ZEM.fasta
CEdata.trim.unique.HSWLB.fasta
CEdata.trim.unique.CEM.fasta
CEdata.trim.unique.RSW5_1.fasta
CEdata.trim.unique.COP50_1.fasta
CEdata.trim.unique.ZOP50_1.fasta
CEdata.trim.unique.HOP50_2.fasta
CEdata.trim.unique.RSW9_1.fasta
CEdata.trim.unique.RSW4_1.fasta
I would like to remove all sequences found in HSWLB from all samples beginning with the H identifier. I would like to do the same with CSWLB and C identifier and ZSWLB and the Z identifier. At this point I think I would like to use the classify.seqs command to classify the sequences in HSWLB, CSWLB and ZSWLB (using Bayesian or blast). However, here is where I run into the problem. The classification is just a taxonomic assignment with a certain level of confidence. If a sequence in CSWLB is assigned to Bacteriodetes with 100% confidence is it right to assume that I ought to remove all Bacteriodetes from samples with C identifier? It seems to me that there might be other Bacteriodetes that are in those samples that are not 100% identical to the sequence in CSWLB but will still be assigned to Bacteriodetes with 100% confidence. That is why I think that blast might be a better option. With blast, I can use perl scripting to remove sequences in C identifier samples that are 97% or higher identical to any sequence found in CSWLB over 200 nt. Is it possible to do this using the classify.seqs blast option? I basically need to be comparing sequences base to base, not taxonomic classification to taxonomic classification.
Thank you,
Kris