sequences left after denoising and more

hi all,

some questions once more … :oops:

I’m working through some data distributed in 8 .sff files. These contain 52 samples, forward and reverse sequences mixed randomly.
I need to extract 9 samples (which our lab is working on) and, as you suggested in an earlier thread, analyse them separately (F and R).

When setting the minflows and maxflows to 0 and 800 respectively, I assume I get the full raw data. There are about 55600 F sequences out of >700k (F+R), which is a reasonable number.
Now, using the default flow sizes of 450, I retain only 27091 sequences, too bad, but, I seems plausible.

Consequently subjecting them to shhh.flows dramatically reduces this number (about 9000 for the default settings, 16000 for the “raw data”). Ok, I assume this could probably be due to some really bad quality data. (What is a normal reduction rate when denoising? The SOP appeared to have a far less reduction in sequences)

The problem is, however, that the pyro data was already analysed by another lab using Pyro/AmpliconNoise, some custom scripts and BWA. Because I was a little suspicious about the quality and since we need to be able to process this kind of data ourselves, I wanted to re-analyse it with mothur. However, the original analysis retained 55200 F sequences, after subjecting to AmpliconNoise. How is this discrepancy possible using the same algorithm?

This eventually leaves me with about 700-1000 sequences/sample using mothur, and 1500-3000 with the original analysis after further removing low quality sequences (too short, …).


Secondly, when creating OTUs, my number of sequences increases again up to 20k+. Now, during the pre-analysis, I renamed some files a couple of times. I read somewhere that mothur somehow "remembers" links to previous files? Is this always the case or is this reset everytime you end mothur? In other words, can you just rename files and then just start up mothur again to continue with these new names?
Thanks

To get things right, after trim.flows you retain a number of sequences. After shhh.flows this number is reduced but the shhh.names file still has the number of sequences (representeed by the ID) you had after trim.seqs. Are these clustered as unique sequences, so this really is the number of sequences you actually work with/are used in the eventual analysis?

:idea: please enlighten me :idea:

:oops:

Kirk,

Consequently subjecting them to shhh.flows dramatically reduces this number (about 9000 for the default settings, 16000 for the “raw data”). Ok, I assume this could probably be due to some really bad quality data. (What is a normal reduction rate when denoising? The SOP appeared to have a far less reduction in sequences)

This eventually leaves me with about 700-1000 sequences/sample using mothur, and 1500-3000 with the original analysis after further removing low quality sequences (too short, …).

So shhh.flows outputs the idealized sequences after the correction process. The duplicate sequence names are in the name file, which is why things look more normal once you create the OTUs. If you run summary.seqs(fasta=, name=) you’ll get the data incorporated from these duplicate sequences. A big difference between what we’re doing/advising and the original package does, is that trim.flows trims the flow grams to 450 flows and throws out anything with fewer flows. In contrast, the original method keeps anything with between 360 and 720 flows. If you look at our recent PLoS ONE paper you’ll see that the original method has a high error rate while our approach has a much lower error rate because most of the errors occur after 450 flows. Remember that errors will generate different sequences and so you’ll get more “unique” sequences with the original approach.

Hope this makes sense…
Pat

Thanks!