trim.seqs (Threshold vs Average)

Hi MOTHUR gurus,

I need some guidance on using average v/s threshold.

When I use q average changing stringency from 15 to 35, the only change is in the number of sequences retained, but the % unique within the retained sequences does not really change too much (from 94 to 85 increasing qaverage=15 to 30). However, for qthreshold (10-29) I tend to loose many more sequences and it does have an effect on % unique. After a point, by increasing the threshold above 15 (for my library) the number of sequences getting thrown out is not reflected in the drop in unique sequences. Infact, the the ratio of (%loss in sequences/% loss of uniques) seems to increase.

I would assume that once the library is “relatively clean” of sequencing errors, any stringency setting should affect all sequences equally and not just selectively remove the “non-uniques”. Is there a reason, the ratio increases with increasing stringency after a certain point? Is it best to select qthreshold below and above which the ratio of (% loss in sequences/% loss of uniques) becomes greater than 1?

Thanks very much for your help,
Ameet

Hi,

One more question. While looking at the quality scores of my sequences, I notice a clear decline (almost linear) in quality scores after about 240 nucleotides. Additionally, there are some sequences where the front end (sequencing end) has low quality scores. Does using qthreshold, with qtrim option set to True, trim the front end (sequencing end) and back end or does the command only trim from the back end? When I try this command, the quality scores of the front end gets better (above the specified threshold, while the back end of the sequence clearly is also clearly better. Does the front end get better by tossing out sequences or does the command attempt to trim also?It would be helpful if the qtrim just trimmed the front end thus improving the retention rate of sequences. If this is not possible currently, would this feature be possible in the future? I can email you the quality score plots if you would like to examine them.

Note: All the sequences have been sorted using oligos where pdiffs=1, bdiffs=0.

Thanks,
Ameet

That is what we see as well. I’m working on the manuscript that describes how we deal with quality score trimming. As we outline in the Costello example, we use a sliding window of 50 bp and when the average quality score in the window drops below 35 we trim the sequence. In addition to stringent chimera checking, this significantly drops the error rate.

Is qual=35 over a window of 50 a very stringent approach? For a very small library of about 600 pyrotags (lengths between 200-350), when I use this approach I am left with less than 10% in the trim file.

From my understanding, a phred score of 20 has been traditionally used as a indicator of good quality for sanger sequences. Is this correct? My assumption was that the quality score in 454 quality file follow the same nomenclature as sanger, where 20 would mean 99% probability of accurate base calling and 35 would be about 99.95% accuracy. But using 35 v/s 20 result in much greater loss of sequences. Is there a measure of how critical that extra .95% is? It will certainly reduce the number of OTU’s (may be by simply reducing the number of sequences), but to avoid erring on the side of too high a diversity, am I biasing the analyses to force a lower diversity?

Is it true that a large portion of the errors in pyrosequencing come from rounding off errors of the light intensity measured, i.e. 3.2 might be rounded of to 3, resulting in deletion of a nucleotide that should infact be present, or on the other hand insertion of the nucleotide that is not present. Can trimming measures using either qaverage or qthreshold be applied with special weights provided to look for errors in windows with more or longer homopolymers? For example, a score of 25 might be good enough in a region where the number of homopolymers is less than a certain %, while a 35 might be better applied in regions of sequences where the % homopolymers in a window is above a certain threshold or regions that have especially long homopolymers. For example, if there is a distribution of homopolymer frequency and homopolymer length for all the sequences in a library, then can a different quality metric (q=20 vs 25 vs 35) be applied for regions of sequences that are closer to the mean of the frequency/length of the homop distribution v/s those that are at the extremes of the distribution? Is there a way to approach this and is this a valid approach at all?

How is error rate calculated? What are the metrics that affect this?

Thanks very much for your help,
Ameet

I guess it depends on how you define stringent :slight_smile: Remember that q20 was developed for genome sequencing where you can have some slop because you will have 12-fold coverage of the base. We have 1-fold coverage of each base. So it’s critical to have high standards. As for short reads, we have been seeing this in recent cases and it is typically ascribed to 454 having problems with the emulsion oil that they’ve been using.

That makes complete sense…but it hurts to toss out sequences…Oh well!

Does trim.seqs also qtrim=T from both ends?

The trimming is only affects the distal end of the sequence. Send all hate mail to 454 Roche :slight_smile:

I’m really interested on this matter as I’m working in targeted metagenomics on fungi and the qtrim is one of the critical steps for my data treatment (I guess like a lot of people in metagenomics actually). So I’m curious about the manuscript you mentionned. Will it be in press soon?