N, ambigs and quality of alignment

grobert · February 15, 2016, 9:24pm

Hi all,

As I understood, a normal procedure for processing sequences always gets rid of ambiguous base calls (i.e. Ns), using maxambig=0 when screening, which means sequences containing an ambiguous base are trashed. But make.contigs might generate Ns - for example if the delta between two bases in a mismatch is below the threshold - and a sequence having a low number of Ns is not necessarily entirely bad (isn’t it?), so why culling globally good sequences because of few ambiguities? Why don’t we just delete these Ns and risking to obtain only a classification at the genus level for this particular sequence instead of species level. (It’s comprehensible if there are a lot of them though). I suppose we loose a huge part of information by doing this.

Further, how does the aligner deal with ambiguities? Does it recognize a N as A, T, C or G or does it think it’s a distinct nucleotide, so it does not find any match with the reference because it does not contain any N.

I also wonder if mothur can recognize other ambiguities like “W” for A or T. If I have such an ambiguity in my primer, must I provide an oligos file with both possibilities?

Thanks a lot

pschloss · February 16, 2016, 11:43am

I think I’ve addressed this in one of your other posts. Our goal is to minimize error rates, not keep them around.

Pat

grobert · February 16, 2016, 4:59pm

I understand the logic behind that, but how much complete sequences is acceptable to remove in this matter? For instance, to process sequences between V3-V4, I am beginning with ~1,5M seqs, and after making the contigs and screening with optimization of the overlapping region, no ambigs allowed and some other parameters, I end up with 350 000 seqs (25%) and divise this number by 2 after unique.seqs. It doesn’t seem “normal”, but I really don’t know… Maybe the sequencing was not rightly made. That’s why I’m sticking on ambiguities. And I guess it could be possible to keep a low error rate by only deleting Ns when there are a few number and not culling the entire sequence, that’s something we could try, I don’t know if you ever considered that.

I thought I could be doing the alignment with a low number of Ns, but if the aligner can’t tolerate Ns (i.e. if it does not know N means A,T,C or G) and it takes an eternity, I won’t, obviously. However, if the align.report shows great scores and similarities… (otherwise I can always screen what’s left). What do you think about all that? Sorry, if I may sound stubborn, I’m really just looking to understand and get the best out of my data.

Thank you Pat!

grobert · February 16, 2016, 8:45pm

Alright, we’re just using too poor quality sequences in V3-V4 region, so everything is biased. I’ll consider using only V4

kimitas · November 30, 2016, 2:52am

Hi There,

Sorry, this is not a reply but a second question…I was searching for a similar question to mine and I found this one…

I just got 16s MiSeq sequences and when I use maxambig=0 … over 50% of my data is removed. I therefore use maxambig=2 and I get to keep about 80% of the sequences (Primers 515f and 806r)

in a previous comment there was something mentioned about using the Kozich method? not sure what this is?

I am not sure what is best to do here?

Thanks!

pschloss · December 5, 2016, 1:39pm

Its the citation at teh top of the MiSeq SOP:

https://mothur.org/wiki/MiSeq_SOP

Pat

Topic		Replies	Views
So many sequences removed with maxambig=0	5	383	March 21, 2023
Ambiguous base call Theory behind mothur	3	4878	February 16, 2016
Several sequences missing primers and many ambiguous base calls Theory behind mothur	7	423	November 24, 2023
All sequences removed with screen.seqs Commands in mothur	6	890	April 8, 2021
Screen.seqs filtering all out - maxambig=0 Commands in mothur	1	2845	November 4, 2014

N, ambigs and quality of alignment

Related topics