Pat Schloss mentioned PyroNoise in the “Announcements” section, and I thought I would start a thread here.
The reference is: Quince et al. Nature Methods 6, 639–641 (1 September 2009) | doi:10.1038/nmeth.1361
This software purports to ‘denoise’ pyrosequencing data, and implies that much of the microbial diversity uncovered by second-generation sequencing is due to errors (mis-read bases, PCR errors, etc). I have been working with this program and my own 454 dataset, and have some big uncertainties about exactly what it is doing, and whether that is biologically appropriate.
Has anyone else been experimenting with this new program?
Some of the steps require MPI (message passing interface). My institution first told me that I would need to get on one of our supercomputers in order to get this capability, but it did end up working on my MacBook Pro, with a few modifications (reported here: http://seqanswers.com/forums/showthread.php?t=3588)
I don’t claim to be an expert, but I’d be interested in hearing more about your concerns. My initial thought is that this is actually more of a technical and not a biological problem. Let me explain. PyroNoise is basically a signal processing software that takes ambiguous flow data and fits a model to the data to make it less ambiguous. I guess the biological issues come in when what is perceived as noise isn’t really noise, but biology. Ultimately, tools like this and others that are sure to follow are going to be tested against mock communities that are sequenced ad naseum. There the risk seems to be over-training the methods and then removing real variation when analyzing real data. Like you, I’d be very interested in hearing what others think.
It is a little subtle, but from what I’ve read, it seems that the Pyronoise algorithm would mainly targets homopolymers and indels, instead of polymorphisms (miss-called bases). Calling homopolymers and indels are where pyrosequencing has the most trouble, right?.
Pyronoise seems to be one of the more promising data pre-filtering programs out there after trying to use CD-HIT and Replicates, but there is a pragmatic issue. After reading the instructions for larger datasets, it doesn’t seem like the program is optimized for the typical large sizes of pyrosequencing datasets. This is from the Pyronoise website (http://people.civil.gla.ac.uk/~quince/Software/PyroNoise.html):
"PCluster can currently handle a maximum of 10,000 flowgrams. Larger data sets have potential numerical error overflow problems and exhaust typically available memory (~8GB). To deal with larger data sets two not exclusive strategies can be employed. The first is an initial clustering of flowgrams based on sequence…
There are some pretty intense pre-steps in using Pyronoise for large datasets, and I’m wondering if they will cause the program to loose some sensitivity and effectiveness. What are the chances of making this more “large-dataset” friendly? Are you thinking about incorporating this into Mothur?
I wonder how much the sample size impacts what PyroNoise does, since the algorithm seems to take the initially present variation into account. If you look at the actual output, you will see that PyroNoise does ‘correct’ quite a few single nucleotide polymorphisms, even in regions without homopolymeric runs. I understand less about how the algorithm identifies these PCR errors than how it detects errors in homopolymers. It seems to be basically removing the rarest variants… in which case, is the result any different than simply using a broader OTU definition? How does the error rate implied by PyroNoise compare to published error rate estimates for the platform? How does ‘de-noising’ occur across variable vs. conserved sites?
I am exploring some of these questions at the moment.
A couple of major limitations that I see in the program are that all of the accession numbers are changed in the output, making it difficult to compare processing with other programs. The criteria for screening out low-quality sequences are not clearly laid out or adjustable by the user. For example, if I do initial processing and quality control in mothur, then select the resulting accession numbers from my original data and run those through PyroNoise, additional sequences will be dropped. The only adjustable quality settings for PyroNoise seem to be matching to the forward primer and a minimum sequence length - since I’ve already used these criteria in mothur, I’m not sure on what basis additional sequences are filtered out. On the other hand, there are a couple of adjustable parameters (sigma, c), but little guidance on how to set these parameter values. You can make the algorithm output change quite a bit, but the paper only states that the default parameter values “gave good results” with their dataset.