Hi! I am running a batch of 16s sequences that have been trimmed to get rid of ambigs (R2 only, removed first 20bp). I am using the Miseq sop and I usually don’t get weird numbers when making contigs or aligning but im getting these numbers that are higher than 251 bp. Is this normal? should i make Maxlength 251 to force them to trim there or is the maxlength in the SOP of 275 fine? The first photo is after unique.seqs but those numbers are the same for the start and end for all the previous steps. The second image is after alignment to the silva 138.2.
Maybe I’m worried over nothing but I am new to bioinformatics so I just wanted to make sure.
You need to ask your sequence provider what they did. I can imagine a few things… First, they could be only generating a partial run and so things are cutting off short. But that doesn’t make sense because you are seeing a 295 nt read (275+20 you cut off). Second, they are doing some level of modification of the data before sending it to you. Third, they are generating longer reads but your amplicon is short and the reads are going off the end of the amplicon. If this is the case then you are very likely to have an artificially high error rate. I don’t recall what region you are sequencing, but if it’s the V4 region then this is a strong possability.
There really is no reason to be removing the first 20 nt of the reads. Your sequence provider should be transparent about their methods and give you raw sequences with the barcodes, primers, adapters, etc that they used. This would be expected if you ever want to publish these data.
I usually see always those superlong fragments in all my markers, but usually are gone during processing. Something I noticed is that they are not even in the same region: it is outside the alignment of the others (look at the start/end; they start after the end of the good reads) so if you now do a PCR seqs 1982-13393 they will be gone. You can try that and see how many you lose (there might be only 1 or 2!).
when I do screen seqs, I lose many of them sadly, our sequence provider trimmed them and sent them back to us along with the primers, Ive tried to get them to align then screen without losing half of them but it has not been working. Theres always two very distinct start and ends so when i screen.seqs I can’t find a way to keep all my samples. My advisor suggested I just don’t trim the silva database and I dont do the screen seqs step after. the below image is after I aligned to silva.nr_v138_2.align. Everything looked great before the alignment so I am unsure what even is going wrong.
My two cents. From the alignment, do in linux or alike a head -n 100 youralignment > output.fasta. That will create a fasta with the first 50 sequences. Open that in Mega or anything else that is visual, and see what is the difference between starting at 13862 and 14285. You might have primers left, or some other artefact that you might be able to solve with a PCR seqs. But you will need to really look at WHAT is causing that difference.
The difference betwee position alignments 13862 and 14285 is probably the difference of one or two bases. Same for 23444 and 25283. I wouldn’t worry about it - it likely has to do with your sequence provider trimming bases.
The 502 nt long contig is two reads getting stitched together that don’t overlap.
These two screenshots don’t seem to go together as they have very different total numbers of sequences