I very confound that how should I know that the start position and the end position when I need to align my sequences to the reference alignment.
Hello Huang Yanfang!
pcr.seqs cuts down the length of the reference file to match your expected region. However it is not a compulsory step, you can simply align to the full length reference file.
pcr.seqs may be appropriate if:
- you are building a pipeline for multiple datasets covering the same amplicon region,
and/or - you are running out of disk space when you try to run a full length alignment.
First try aligning against the full length database (eg. “silva.bacteria.fasta” or “silva.seed_v132.align”) to see if it works. If you choose to make a custom alignment region, you could estimate the start and end locations based on alignment of your amplicon primers, or (if your full dataset is prohibitively large) by aligning a subsample of your data against the full length database.
You could do the latter as follows:
First extract the first 1000 reads from each fastq file.
Run these fastqs through the first few steps of the Mothur SOP.
Then:
align.seqs(fasta=first1000.unique.fasta, reference=silva.seed_v132.align)
summary.seqs(fasta=current, count=current)
You can then deduce the alignment region from the output. For example:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 13858 21274 117 0 3 1
2.5%-tile: 13862 21284 125 0 3 198901
25%-tile: 13862 21287 125 0 3 1989008
Median: 13862 21287 125 0 4 3978015
75%-tile: 13862 21287 125 0 4 5967022
97.5%-tile: 13862 21293 125 0 5 7757129
Maximum: 13862 21296 125 0 8 7956029
Mean: 13861 21286 124 0 3
For this dataset you’d see that everything aligns between 13,000 and 22,000 relative to the reference file (which has 50,000 columns). You could make your custom reference file using these values:
pcr.seqs(fasta=silva.seed_v132.align, start=13000, end=22000, keepdots=F, processors=8)
I hope that is helpful. Good luck with your analysis!
Thanks @SJSalter - in case others are looking for where @1106351165 got the screenshot from, there are instructions on customizing the region for use with pcr.seqs
available here…
This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.