As I understood, a normal procedure for processing sequences always gets rid of ambiguous base calls (i.e. Ns), using maxambig=0 when screening, which means sequences containing an ambiguous base are trashed. But make.contigs might generate Ns - for example if the delta between two bases in a mismatch is below the threshold - and a sequence having a low number of Ns is not necessarily entirely bad (isn’t it?), so why culling globally good sequences because of few ambiguities? Why don’t we just delete these Ns and risking to obtain only a classification at the genus level for this particular sequence instead of species level. (It’s comprehensible if there are a lot of them though). I suppose we loose a huge part of information by doing this.
Further, how does the aligner deal with ambiguities? Does it recognize a N as A, T, C or G or does it think it’s a distinct nucleotide, so it does not find any match with the reference because it does not contain any N.
I also wonder if mothur can recognize other ambiguities like “W” for A or T. If I have such an ambiguity in my primer, must I provide an oligos file with both possibilities?
Thanks a lot
I think I’ve addressed this in one of your other posts. Our goal is to minimize error rates, not keep them around.
I understand the logic behind that, but how much complete sequences is acceptable to remove in this matter? For instance, to process sequences between V3-V4, I am beginning with ~1,5M seqs, and after making the contigs and screening with optimization of the overlapping region, no ambigs allowed and some other parameters, I end up with 350 000 seqs (25%) and divise this number by 2 after unique.seqs. It doesn’t seem “normal”, but I really don’t know… Maybe the sequencing was not rightly made. That’s why I’m sticking on ambiguities. And I guess it could be possible to keep a low error rate by only deleting Ns when there are a few number and not culling the entire sequence, that’s something we could try, I don’t know if you ever considered that.
I thought I could be doing the alignment with a low number of Ns, but if the aligner can’t tolerate Ns (i.e. if it does not know N means A,T,C or G) and it takes an eternity, I won’t, obviously. However, if the align.report shows great scores and similarities… (otherwise I can always screen what’s left). What do you think about all that? Sorry, if I may sound stubborn, I’m really just looking to understand and get the best out of my data.
Thank you Pat!
Alright, we’re just using too poor quality sequences in V3-V4 region, so everything is biased. I’ll consider using only V4
Sorry, this is not a reply but a second question…I was searching for a similar question to mine and I found this one…
I just got 16s MiSeq sequences and when I use maxambig=0 … over 50% of my data is removed. I therefore use maxambig=2 and I get to keep about 80% of the sequences (Primers 515f and 806r)
in a previous comment there was something mentioned about using the Kozich method? not sure what this is?
I am not sure what is best to do here?
Its the citation at teh top of the MiSeq SOP: