"finesse" in screen.seqs

Hey guys,
I’ve been wondering about the start/end positions you get after aligning sequences and then the follow up of screen.seqs based on some criteria. I’m using the V1V2 regions in my dataset, sequencing from the 5’ (27F) end… Ideally i would want my sequences to incorporate as much as the variable regions as possible. Now the V1V2 region span (roughly) positions 50 to 300 on the 16S gene http://www.bioinformatics-toolkit.org/Help/Topics/hypervariableRegions.html … But i find the start/end positions given by mothur to be confusing in the sense of position on the 16S gene. For example, with the Costello dataset, after screening and filtering, the start-end are 1-374… this is where I get thrown off… is this just the length of which your sequences are being compared to each other with??

What i did then was align this sequences to the positions on the e.coli 16S and found that they started ~160 and ended ~330 (which was the sequencing primer end, and makes be think this was a correct method)… So now i got to see where most of my sequence analysis on the 16S gene was taking place… which in the end excluded lots of the V1 region (assuming it’s around positions 50-150)

My question then;

Is this what you think about when doing your anlaysing? it might be great to get mothur to work out the length/start:end to keep 80% of your sequences, but maybe 65% of them will give you a longer length, but not just any longer length, length that goes into important parts of the gene. OR maybe you screen out sequences that end before some point that maybe doesn’t have as much information in it anyway

Is this a correct way to think about this?

Shaun

For example, with the Costello dataset, after screening and filtering, the start-end are 1-374… this is where I get thrown off… is this just the length of which your sequences are being compared to each other with??

This is the alignment length. If you use the 8f/27f primer then your sequence will start at E. coli position 28. Tack on another 250 bases of sequence and you’ll be around 275 bp. Throw in the gaps to make the alignment and it’s conceivable that you could get to 370 as an alignment length. Keep in mind that the V1 region is known to have a lot of insertions/introns (e.g. TM7).

What i did then was align this sequences to the positions on the e.coli 16S and found that they started ~160 and ended ~330 (which was the sequencing primer end, and makes be think this was a correct method)… So now i got to see where most of my sequence analysis on the 16S gene was taking place… which in the end excluded lots of the V1 region (assuming it’s around positions 50-150)

I’m not sure I follow, but if you’re trying to align to an unaligned Ecoli sequence that isn’t likely to go too well. A better approach would be to align to silva.bacteria.fasta, fish out the Ecoli reference sequence (e.g. Z83205.1) and then compare teh numbering in that sequence to what you’ve got in your newly aligned sequences. Keep in mind that we remove the 27f primer sequence so you’ll have to tack on 27 to your counts.

Is this what you think about when doing your anlaysing? it might be great to get mothur to work out the length/start:end to keep 80% of your sequences, but maybe 65% of them will give you a longer length, but not just any longer length, length that goes into important parts of the gene. OR maybe you screen out sequences that end before some point that maybe doesn’t have as much information in it anyway

I generally try to maximize the number of sequences and get them to be as long as possible. I don’t worry too much about the specific region. I may want them to all start at the same position (e.g. start=1044) and require them to end after a specific position or be a minimum length. You can also do what you suggest using the optimize option (Redirecting…). These different strategies are laid out in the Costello example analysis on the wiki.

Hope this helps,
Pat