use screen.seqs on sequences with few overlaps

Hi,

I aligned my 16S sequences with silva database and it looked like this:
mothur > summary.seqs(fasta=16S_001_R1_trim.unique.align)


Start End NBases Ambigs Polymer NumSeqs Minimum: 0 0 0 0 1 1 2.5%-tile: 1044 1046 1 0 1 419 25%-tile: 1044 1143 13 0 2 4187 Median: 10277 16298 46 0 3 8373 75%-tile: 42590 43116 134 0 4 12559 97.5%-tile: 43115 43116 151 0 6 16327 Maximum: 43116 43117 151 0 8 16745 Mean: 17707.3 20214.5 70.444 0 3.17766 # of Seqs: 16745

Which to me there are very few overlaps among the sequences, so I tried the following command with the screen.seqs:
screen.seqs(fasta=16S_001_R1_trim.unique.align, name=16S-001_S1_L001_R1_001.trim.names, optimize=start-end, criteria=60)

and got this

mothur > summary.seqs(fasta=16S_001_R1_trim.unique.good.align)


Start End NBases Ambigs Polymer NumSeqs Minimum: 7699 14946 100 0 2 1 2.5%-tile: 7926 14956 102 0 3 86 25%-tile: 9898 15690 126 0 3 852 Median: 10275 16298 147 0 4 1703 75%-tile: 13855 21304 151 0 5 2554 97.5%-tile: 14985 21935 151 0 8 3319 Maximum: 14987 21994 151 0 8 3404 Mean: 11192.8 17843.5 138.123 0 4.17215 # of Seqs: 3404

Which looks to me, the 3404 sequences are still not over lapping. Do you have any recommendations on what parameters I should use for screen.seqs? or any other commands I should try?

Thanks,
Alvin

Did you ever flip your sequences? Judging from your file name (R1), I suspect these are reverse reads. If so, you likely need to do the reverse compliment of them adn then align them or at least run align.seqs with the flip=T option. Your tables don’t look like they’ve aligned right.

Hi Pat,

Thanks for the reply. The samples were prepared such that some sequences are flipped and some are not…confusing? I know, it took me awhile to figure it out too, as I didn’t prepare the samples. The alignment above was ran with flip=t

I quickly tried reverse complementing all the sequences and aligned them again with option flip=t:


mothur > summary.seqs(fasta=16S_001_R1_trim.unique.rc.align)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1096 12 0 2 419
25%-tile: 8198 14949 104 0 3 4187
Median: 13862 21336 132 0 4 8373
75%-tile: 16416 23439 150 0 4 12559
97.5%-tile: 43053 43116 151 0 6 16327
Maximum: 43116 43117 151 0 8 16745
Mean: 16077.4 20674 115.681 0 3.91717

of Seqs: 16745

Do you have any suggestions on what should I try next? or should I spend the effort to re prepare the sample?

Thanks,
Alvin

The thing that doesn’t make sense is that there’s no clear start/stop point for any large chunk of the sequences. What region were they sequencing?

it should be the v3-v4 region ~500bp, but the illumina sequences have a maximum length of 151bp. The technology used was Nextera XT, where the amplicons were cut and amplified randomly. hmm…I guess there’s the problem. In screen.seqs, i used the flag “optimize=start-end”, attempted to find the region with the most overlap. The result didn’t look very good as we saw.

Huh, yeah, that’s an interesting way to do it. Why? Most people would instead amplify a specific window within the gene and sequence that. I’m afraid what you have will be a mess to analyze. About all I can suggest is to run the classifier on everythign and see what you’ve got. Unfortunately, even that’s frought with problems since the different regions of the gene evolve at different rates.

Pat