Flows algorithm

I am comparing the mothur with other pipelines, just to see the differences.

On one other pipeline I’m comparing, it does the normal trimming of barcodes, primers, checks length of sequences, checks quality of each base, etc… (all the normal stuff that you would expect - nothing algorithmically intensive). This removes roughly 25-30% of my sequences.
However, when I compare to this to trim.flows --> shhh.flows, many more of my sequences are removed (roughly 90% of my sequences).

Why is this?


If you look at the sequence names in the scrap file you will see a | and then single letter codes that indicate why a sequence was scrapped. See the wiki page for trim.flows for a description. Without knowing how you ran trim.flows or what type of data you are sequencing, it’s hard to know what’s going on. If you’re on a unix/mac box you can run the following to see why sequences are getting chucked…

cut -f 1 -d " " *scrap.flow | cut -f 2 -d “|” | sort | uniq -c

If you get a b, f, or l that will indicate mismatches to the barcode, forward primer, or the length.


What about shhh.flows? How does it check the sequence to determine if something is noise or not?

You might want to check out the following:

Schloss PD, Gevers D, Westcott SL (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS ONE. 6:e27310.
Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ (2011). Removing noise from pyrosequenced amplicons. BMC Bioinformatics 12:38.
Quince C, Lanzén A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT (2009). Accurate determination of microbial diversity from 454 pyrosequencing data. Nat. Methods 6:639.

The Quince papers describe the PyroNoise algorithm that we cloned into C++ as shhh.flows.