Hi,
I noticed that even after all the denoising-cleaning-trimming-chopping-chimera-slayring-younameit there are still some bad (really bad) sequences that managed to slip through all the filters.
These aren’t a lot, maybe a couple of dozens for half a plate of 454, but still they can cause biases downstream particularly with phylometric indices because these bad sequences make really long branches in the trees.
Here’s how I get rid of them:
Right after generating the ‘final.xxx’ files at the end of ‘Reducing sequencing error’
- Generate an ML approximation tree using FastTree (this takes only a few min. http://microbesonline.org/fasttree/)
system(FastTree -gtr -nt < final.fasta > final.ml.tre)
- Create an arb db with final.fasta and import the tree you just generated
- Switch to radial tree view; the bad sequences, if there are any, will stick out as really long branches
- Mark those long branches and export them as fasta (you might also want to export their acc numbers as .nds)
I call these suspected.seqs.fasta and suspected.seqs.nds - Blast those sequences. I call a remote blast using a locally installed Blast+
system(blastn -task blastn -remote -db nr -query suspected.seqs.fasta -evalue 0.00001 -dust no -max_target_seqs 1 -html -out output.blastn.html)
- Manually inspect the blast results for obvious chimeras and other bad sequences (really low match, very bad alignments etc.)
- List of their acc numbers (or more correctly retain them in the suspected.seqs.nds and erase all the good ones)
- Run:
remove.seqs(accnos=validated.bad.seqs.nds, fasta=final.fasta)
remove.seqs(accnos=validated.bad.seqs.nds, name=final.names)
remove.seqs(accnos=validated.bad.seqs.nds, group=final.groups)
Done!
Roey