Is there any way to remove entire rows from a *.names file? Remove.seqs only takes the specified accession numbers, leaving other identical sequences in place. I ran the chimera check on my *.unique.fasta file, and found > 2000 chimeric sequences so I want to remove those plus anything that is identical to one of those from my names file… and I don’t want to do this manually!
I started with 28746 unique sequences, and a names file with an equal number of lines.
All.pick.good.filter.filter.unique.fasta - 28746 seqs
All.pick.good.filter.filter.names - 28746 lines; 254640 accnos
The chimera check discovers 2273 chimeric sequences among those 28746
CHIMERA CHECK
chimera.seqs(fasta=All.pick.good.filter.filter.unique.fasta, template=silva.filter.filter.fasta, method=pintail, processors=2)
See silva_chimeras.xls; 2273 chimeric seqs
Remove putative chimeric sequences
remove.seqs(accnos=silva_chimeras, fasta=All.pick.good.filter.filter.unique.fasta)
remove.seqs(accnos=silva_chimeras, name=All.pick.good.filter.filter.names)
remove.seqs(accnos=silva_chimeras, group=sample.pick.good.group)
All.pick.good.filter.filter.unique.pick.fasta - 26473 seqs
All.pick.good.filter.filter.pick.names - 27276 lines; 252367 accnos
sample.pick.good.pick.group - 252367 seqs
After running remove.seqs, you can see that I have more lines in my *.pick.names file than I have sequences in my *.unique.pick.fasta; this is because remove.seqs only takes away the first accession number in the row, and leaves the rest (EVEN THOUGH ALL ACCESSION NUMBERS IN A ROW HAVE IDENTICAL SEQUENCES) - so in this case I haven’t actually removed all of the chimeric sequences from the dataset.