Hi!
I am very new Mothur and want to customize my reference alignment to V3-V4 region. I read the blog post and I don’t know how to trim E.coli 16S rRNA gene sequence in the blog post to start and end with my primers. My primers are in 5’-3’ direction:
341F → CCTAYGGGRBGCASCAG
806R → GGACTACNNGGGTATCTAAT
Thanks in advance for helping me out!
Edit:
I tried using a different E.coli 16S rRNA gene sequence (GenBank ID: MF953263.1) and matched it with my primers and removed the nucleotides before the forward primer and after the reverse complement of the reverse primer and finally removed the primers too all using a text editor.
Then I followed the procedure in the blog post and found out the start and end positions (start=6428, end=23444).
mothur > summary.seqs(fasta=ecoli_v3v4.align)
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 6428 23444 426 0 5 1
2.5%-tile: 6428 23444 426 0 5 1
25%-tile: 6428 23444 426 0 5 1
Median: 6428 23444 426 0 5 1
75%-tile: 6428 23444 426 0 5 1
97.5%-tile: 6428 23444 426 0 5 1
Maximum: 6428 23444 426 0 5 1
Mean: 6428 23444 426 0 5
# of Seqs: 1
Then after running pcr.seqs with the start and end positions and also running summary.seqs on the output I got this:
mothur > summary.seqs(fasta=silva.nr_v138_1.pcr.align)
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 15343 221 0 3 1
2.5%-tile: 1 17017 401 0 4 3666
25%-tile: 1 17017 406 0 4 36651
Median: 1 17017 425 0 5 73301
75%-tile: 1 17017 429 0 6 109951
97.5%-tile: 1 17017 571 1 7 142936
Maximum: 3826 17017 1631 5 16 146601
Mean: 1 17016 433 0 5
# of Seqs: 146601
It took 2 secs to summarize 146601 sequences.
Output File Names:
silva.nr_v138_1.pcr.summary
Then I used screen.seqs and set the maxlength to 571 and minlength to 400. Running summary.seqs on that output got me this:
mothur > summary.seqs(fasta=silva.nr_v138_1.pcr.good.align)
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 15343 400 0 3 1
2.5%-tile: 1 17017 404 0 4 3373
25%-tile: 1 17017 406 0 4 33722
Median: 1 17017 425 0 5 67444
75%-tile: 1 17017 429 0 5 101166
97.5%-tile: 1 17017 569 0 7 131515
Maximum: 3826 17017 571 0 8 134887
Mean: 1 17016 430 0 4
# of Seqs: 134887
It took 2 secs to summarize 134887 sequences.
Output File Names:
silva.nr_v138_1.pcr.good.summary
After all that, I followed the MiSeq_SOP for my data. This is what I got after running summary.seqs on the stability.trim.contigs.fasta
mothur > summary.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table)
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 20 20 0 1 1
2.5%-tile: 1 124 124 0 3 10213
25%-tile: 1 205 205 0 4 102129
Median: 1 299 299 0 5 204258
75%-tile: 1 465 465 0 6 306386
97.5%-tile: 1 465 465 1 6 398302
Maximum: 1 602 602 112 301 408514
Mean: 1 322 322 0 5
# of unique seqs: 408514
total # of seqs: 408514
It took 8 secs to summarize 408514 sequences.
Output File Names:
stability.trim.contigs.summary
Next, I ran screen.seqs with minlength as 232, maxlength as 285 (because this is the range of read lengths of my trimmed raw data, trimmed using trimgalore --paired)
Then I used unique.seqs and got this after running summary.seqs on my unique.seqs output
mothur > summary.seqs(count=stability.trim.contigs.good.count_table)
Using stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 232 232 0 3 1
2.5%-tile: 1 233 233 0 3 652
25%-tile: 1 243 243 0 4 6514
Median: 1 252 252 0 5 13028
75%-tile: 1 264 264 0 6 19541
97.5%-tile: 1 284 284 0 6 25403
Maximum: 1 285 285 0 8 26054
Mean: 1 254 254 0 5
# of unique seqs: 16624
total # of seqs: 26054
It took 1 secs to summarize 26054 sequences.
Output File Names:
stability.trim.contigs.good.unique.summary
I face this problem after I run align.seqs with the output of unique.seqs and my custom reference alignment. The summary.seqs output of the aligned sequences:
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1 9 4 0 1 652
25%-tile: 1 9542 8 0 3 6514
Median: 16124 17017 18 0 3 13028
75%-tile: 16156 17017 33 0 4 19541
97.5%-tile: 16161 17017 259 0 6 25403
Maximum: 17017 17017 282 0 8 26054
Mean: 10917 13635 64 0 3
# of unique seqs: 16624
total # of seqs: 26054
It took 1 secs to summarize 26054 sequences.
Output File Names:
stability.trim.contigs.good.unique.summary
This does not look right to me and I don’t know where I went wrong, hence my lengthy step by step breakdown on what I had done to get to this point.
Please help me understand where I went wrong and how to rectify it.
If you reached the end of this post, I applaud your patience and would be very grateful if you helped me with this.
Thanks again!