Customizing reference alignment for the V3-V4 region

venki09 · September 20, 2024, 6:47am

Hi!

I am very new Mothur and want to customize my reference alignment to V3-V4 region. I read the blog post and I don’t know how to trim E.coli 16S rRNA gene sequence in the blog post to start and end with my primers. My primers are in 5’-3’ direction:

341F → CCTAYGGGRBGCASCAG
806R → GGACTACNNGGGTATCTAAT

Thanks in advance for helping me out!

Edit:

I tried using a different E.coli 16S rRNA gene sequence (GenBank ID: MF953263.1) and matched it with my primers and removed the nucleotides before the forward primer and after the reverse complement of the reverse primer and finally removed the primers too all using a text editor.

Then I followed the procedure in the blog post and found out the start and end positions (start=6428, end=23444).

mothur > summary.seqs(fasta=ecoli_v3v4.align)

Using 32 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        6428    23444   426     0       5       1
2.5%-tile:      6428    23444   426     0       5       1
25%-tile:       6428    23444   426     0       5       1
Median:         6428    23444   426     0       5       1
75%-tile:       6428    23444   426     0       5       1
97.5%-tile:     6428    23444   426     0       5       1
Maximum:        6428    23444   426     0       5       1
Mean:   6428    23444   426     0       5
# of Seqs:      1

Then after running pcr.seqs with the start and end positions and also running summary.seqs on the output I got this:

mothur > summary.seqs(fasta=silva.nr_v138_1.pcr.align)

Using 32 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       15343   221     0       3       1
2.5%-tile:      1       17017   401     0       4       3666
25%-tile:       1       17017   406     0       4       36651
Median:         1       17017   425     0       5       73301
75%-tile:       1       17017   429     0       6       109951
97.5%-tile:     1       17017   571     1       7       142936
Maximum:        3826    17017   1631    5       16      146601
Mean:   1       17016   433     0       5
# of Seqs:      146601

It took 2 secs to summarize 146601 sequences.

Output File Names:
silva.nr_v138_1.pcr.summary

Then I used screen.seqs and set the maxlength to 571 and minlength to 400. Running summary.seqs on that output got me this:

mothur > summary.seqs(fasta=silva.nr_v138_1.pcr.good.align)

Using 32 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       15343   400     0       3       1
2.5%-tile:      1       17017   404     0       4       3373
25%-tile:       1       17017   406     0       4       33722
Median:         1       17017   425     0       5       67444
75%-tile:       1       17017   429     0       5       101166
97.5%-tile:     1       17017   569     0       7       131515
Maximum:        3826    17017   571     0       8       134887
Mean:   1       17016   430     0       4
# of Seqs:      134887

It took 2 secs to summarize 134887 sequences.

Output File Names:
silva.nr_v138_1.pcr.good.summary

After all that, I followed the MiSeq_SOP for my data. This is what I got after running summary.seqs on the stability.trim.contigs.fasta

mothur > summary.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table)

Using 32 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       20      20      0       1       1
2.5%-tile:      1       124     124     0       3       10213
25%-tile:       1       205     205     0       4       102129
Median:         1       299     299     0       5       204258
75%-tile:       1       465     465     0       6       306386
97.5%-tile:     1       465     465     1       6       398302
Maximum:        1       602     602     112     301     408514
Mean:   1       322     322     0       5
# of unique seqs:       408514
total # of seqs:        408514

It took 8 secs to summarize 408514 sequences.

Output File Names:
stability.trim.contigs.summary

Next, I ran screen.seqs with minlength as 232, maxlength as 285 (because this is the range of read lengths of my trimmed raw data, trimmed using trimgalore --paired)

Then I used unique.seqs and got this after running summary.seqs on my unique.seqs output

mothur > summary.seqs(count=stability.trim.contigs.good.count_table)
Using stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 32 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       232     232     0       3       1
2.5%-tile:      1       233     233     0       3       652
25%-tile:       1       243     243     0       4       6514
Median:         1       252     252     0       5       13028
75%-tile:       1       264     264     0       6       19541
97.5%-tile:     1       284     284     0       6       25403
Maximum:        1       285     285     0       8       26054
Mean:   1       254     254     0       5
# of unique seqs:       16624
total # of seqs:        26054

It took 1 secs to summarize 26054 sequences.

Output File Names:
stability.trim.contigs.good.unique.summary

I face this problem after I run align.seqs with the output of unique.seqs and my custom reference alignment. The summary.seqs output of the aligned sequences:

mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)

Using 32 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      1       9       4       0       1       652
25%-tile:       1       9542    8       0       3       6514
Median:         16124   17017   18      0       3       13028
75%-tile:       16156   17017   33      0       4       19541
97.5%-tile:     16161   17017   259     0       6       25403
Maximum:        17017   17017   282     0       8       26054
Mean:   10917   13635   64      0       3
# of unique seqs:       16624
total # of seqs:        26054

It took 1 secs to summarize 26054 sequences.

Output File Names:
stability.trim.contigs.good.unique.summary

This does not look right to me and I don’t know where I went wrong, hence my lengthy step by step breakdown on what I had done to get to this point.

Please help me understand where I went wrong and how to rectify it.

If you reached the end of this post, I applaud your patience and would be very grateful if you helped me with this.

Thanks again!

pschloss · September 24, 2024, 12:58pm

Good work so far!

Instead of doing this…

Then I used screen.seqs and set the maxlength to 571 and minlength to 400. Running summary.seqs on that output got me this:

Could you run screen.seqs where the only arguments are start=1, end=17017 and then repeat the following steps you listed. It looks like the parameters you’re including are including a lot of odd ball sequences through that should likely be removed.

Pat

system · October 4, 2024, 12:58pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
v3 and v4 region on Silvia and GreenGenes database. Commands in mothur	1	2792	September 23, 2014
Customize Silva reference for V4 region Commands in mothur	6	552	August 13, 2023
Execute pcr.seqs Commands in mothur	3	901	August 27, 2021
Trim silva.bacteria.fasta to V3-V4 region Theory behind mothur	2	4242	December 28, 2014
18S Customize your reference alignment Theory behind mothur	3	1636	September 21, 2019

Customizing reference alignment for the V3-V4 region

Related topics