quality scores with the alignment


Ive just recieved my Hiseq dataset and have been playing around with quality filtering to reduce the amount of data (~700k sequences per sample is just too much!). For the first analysis I just discarded 90% of the data and ran my usual worklfow which works fine.

To potentially improve the output and use the entire data set i wanted to test the effects of different quality filtering criteria.
Ive found that using quality filtering heavily impacts my community composition downstream.
For example, my most abundant phylotype drops from 50% relative abundance to around 10 % with increasing quality stringency criteria.

In order to avoid removing sequences based on quality and having to adress this bias I was wondering if it would be better to remove columns from the alignment based on the quality scores of the sequences. Is there a way to get quality scores associated with the alignment? And if not, is this something that could get implemented?



I think using HiSeq data is going to be very perilous (http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/). If you want to go that route, you should really have Mock community data so you know what your error rates are as you make decisions throughout your pipeline.


Hello Pat,

Thanks for the reply. The Hiseq was something the sequencing company chose to run. As Im strapped for time, rerunning with Miseq is not really an option.

My concern is that some amplicons will return with ambigous or poor quality basecalls more easily than others. And thus removing sequences based on quality biases the observed composition. Instead I would rather align all the data and remove poor quality columns to reduced to total number of sequences that I need to cluster. What would be your concerns with this approach?


I think you’re trading biases. You’re probably going to be stuck with phylotyping the data and going from there.