18s v9 for eukaryotes, max length parameter?


I am currently processing data for my PI. It was decided to use 2x300 runs to create the amplicons…

With this in mind, I am wondering what the appropriate max length is… the SOP for 16s v4 shows that you input 275bp for max length (while the amplicons should probably be around 250bp). How is this parameter determined? What is the appropriate number to scale up to/how does one make this decision?

I chose a max length of 350, because the 97.5% tile showed a value of 343 bp. However, I feel like I picked this value somewhat arbitrarily, without any strong justification for ‘why’ I did this…

how long is the v9 supposed to be for your organisms?

If we used 2x300 runs, doesn’t this mean that the amplicon length is ~300bp…?

no that’s your sequencing length. If you were to run your amplicon on a gel, how long do you expect it to be?

ohhh…I see. I will have to check with my PI/sequencing facility then…I am not sure. Let’s say they were ~250bp; how would one go about picking the max length?

The SOP shows a max length of 275bp…is an additional 25bp a standard? Or is there more reasoning into picking a max length based on the ‘summary.seqs’ output?

I would always go with your expected length + a little buffer. Your summary.seqs data should mainly give you an idea about how your output is like and to see the quality of it. If 50% of your sample is double the size that you expect it to be, then there most likely something wrong.

A snipped from the SOP that talks about it:
"This tells us that we have 152360 sequences that for the most part vary between 248 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this. Recall that the reads are supposed to be 251 bp each. This read clearly didn’t assemble well (or at all). "

certainly add a bit of a buffer. variable regions have variable lengths. For example there are some thermophilic clostridia that have an extra 100bp in v2.

Everything people have been saying is spot on. One other comment…

I seem to recall that V9 isn’t all that long. If your amplicon is shorter than 300 nt long, then you’re going to get even more errors. We ran into this when we were initially trying to get 2x300 to work with the v4 region, we had to dial back the v3 chemistry to do 2x250 (and it was still horrible).

Ok, yeah…I am having more questions arise the more I get into this…

Our DNA sequencing facility said the the amplicon length was 300-350, depending on the organisms…which makes me wonder about the appropriate max length to use…

I am also wondering, since the variable regions, do indeed vary between organisms, how is that we align primers to a reference sequence, and then trim the database using that reference sequence? Will this exclude some sequences that should actually be part of the downstream analysis? (I guess I am not fully comprehending how this process occurs).

you insert gaps in the sequence so each column of the alignment has what we hope are evolutionary similar bases. do head on the aligned silva file to see what this looks like. or open that alignment in an alignment viewer.

