Suggested new step for the Schloss SOP

Hi,
I noticed that even after all the denoising-cleaning-trimming-chopping-chimera-slayring-younameit there are still some bad (really bad) sequences that managed to slip through all the filters.
These aren’t a lot, maybe a couple of dozens for half a plate of 454, but still they can cause biases downstream particularly with phylometric indices because these bad sequences make really long branches in the trees.
Here’s how I get rid of them:
Right after generating the ‘final.xxx’ files at the end of ‘Reducing sequencing error’

  1. Generate an ML approximation tree using FastTree (this takes only a few min. http://microbesonline.org/fasttree/)
system(FastTree -gtr -nt < final.fasta > final.ml.tre)
  1. Create an arb db with final.fasta and import the tree you just generated
  2. Switch to radial tree view; the bad sequences, if there are any, will stick out as really long branches
  3. Mark those long branches and export them as fasta (you might also want to export their acc numbers as .nds)
    I call these suspected.seqs.fasta and suspected.seqs.nds
  4. Blast those sequences. I call a remote blast using a locally installed Blast+
system(blastn -task blastn -remote -db nr -query suspected.seqs.fasta -evalue 0.00001 -dust no -max_target_seqs 1 -html -out output.blastn.html)
  1. Manually inspect the blast results for obvious chimeras and other bad sequences (really low match, very bad alignments etc.)
  2. List of their acc numbers (or more correctly retain them in the suspected.seqs.nds and erase all the good ones)
  3. Run:
remove.seqs(accnos=validated.bad.seqs.nds, fasta=final.fasta)
remove.seqs(accnos=validated.bad.seqs.nds, name=final.names)
remove.seqs(accnos=validated.bad.seqs.nds, group=final.groups)

Done!

Roey

Roey, what are the sequences? I suspect they have low bootstrap support to “Bacteria” or are mitochondria/chloroplasts.

Pat

Hi Pat,
Some were obvious chimera, other were just bad quality sequences (bad enough to show low quality alignment in blast, mostly due to homopolymers).
Hard to generalize since these were really just a few (I had 35 sequences out of over 120,000).
These prob. don’t affect too many downstream calculations; my only fear was from a serious bias in phylometric calculations (not tested though).

Roey