Hi,
I am currently running through the MiSeqSOP. I have made a custom 16S v4 reference file from the silva.bacteria.fasta available from the mothur website then aligned my sequences:
_#Download a complete 16S bacterial sequence from NCBI as fasta format and add the primers used for amplification of the 16S gene for the NGS data generation to the 16S sequence fasta file
16S_serratia
#Align the primers on the 16S sequence using Seaview with the Clustal algorithm, then cut the region of 16S sequence targeted by your primers, and save it as a fasta file
e.g. in this case the v4 region: 16Sv4_serratia
#In Mothur
#Align the targeted 16S region sequence to the Mothur reference file (silva.bacteria.fasta)_
align.seqs(fasta=/home/chris/Desktop/16Sv4_serratia, reference=/media/chris/Storage/Genomes_Transcriptomes_Proteomes/16S_microbial_sequences/Silva/Mothur/silva.bacteria/silva.bacteria.fasta, processors=8)
#Summarise the alignment
summary.seqs(fasta=/home/chris/Desktop/16Sv4_serratiaalign, processors=8)
[/i]
Start End NBases Ambigs Polymer NumSeqs
Minimum: 219 0 4 1
2.5%-tile: 6119 13861 219 0 4 1
25%-tile: 6119 13861 219 0 4 1
Median: 6119 13861 219 0 4 1
75%-tile: 6119 13861 219 0 4 1
97.5%-tile: 6119 13861 219 0 4 1
Maximum: 6119 13861 219 0 4 1
Mean: 6119 13861 219 0 4
of Seqs: 1[/size]
Based on the information from the summary.seqs (above) I get the start and end positions where most of the reads will align to the reference, and can therefore use pcr.seqs as in the MiSeqSOP to make a region targeted reference alignment which will allow us to have smaller .align files.
pcr.seqs(fasta=/media/chris/Storage/Genomes_Transcriptomes_Proteomes/16S_microbial_sequences/Silva/Mothur/silva.bacteria/silva.bacteria.fasta, start=6119, end=13861, processors=8, keepdots=F)
#Rename the file SILVAv4.fasta[/i]
I then align my data to the custom alignment:
align.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/5_unique.seqs/merged16S.good.unique.fasta, reference=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_pcr.seqs/SILVAv4.fasta, processors=8)
[/i]
After the alignment I have 875936 sequences (see below) though some donĂ´t align exactly at the same position:
summary.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/5_unique.seqs/merged16S.good.unique.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/6_count.seqs/merged16S.good.count_table, processors=8)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 21899
25%-tile: 7028 7742 8 0 2 218985
Median: 7028 7742 8 0 2 437969
75%-tile: 7028 7742 8 0 2 656953
97.5%-tile: 7736 7742 54 0 3 854038
Maximum: 7742 7742 119 0 18 875936
Mean: 6250.26 7088 10.6911 0 2.0833
of unique seqs: 392343
total # of seqs:[/i]
Following the SOP I then use screen.seqs to remove those badly aligning sequences, but I find myself with only 299016 sequences afterâŚ(see steps below):
screen.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_align.seqs/merged16S.good.unique.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/6_count.seqs/merged16S.good.count_table, summary=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_align.seqs/merged16S.good.unique.summary, start=7028, end=7742, maxhomop=8, processors=8)
#Check that all required undesired sequences were removed:
summary.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/9_screen.seqs/merged16S.good.unique.good.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/9_screen.seqs/merged16S.good.good.count_table, processors=8)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 2067 7742 8 0 1 1
2.5%-tile: 4118 7742 8 0 2 7476
25%-tile: 7028 7742 8 0 2 74755
Median: 7028 7742 8 0 2 149509
75%-tile: 7028 7742 8 0 2 224263
97.5%-tile: 7028 7742 55 0 3 291541
Maximum: 7028 7742 113 0 8 299016
Mean: 6692.58 7742 13.0448 0 2.05437
of Seqs: [/i]
I do not understand how/why screen.seqs �??
Having checked on the mothur forum, I saw that someone else was having a somewhat similar problem (http://mothur.ltcmp.net/t/screen-seqs-removed-47-of-my-sequences/1429/1), and that it was suggested that for this iteration of the screen.seqs command, he should swap the positions specified for the âstartâ and âendâ arguments. If I try to do so I do indeed have much less sequences removed but I am not sure that I am doing the right thing.
summary.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_align.seqs/merged16S.good.unique.good1.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_align.seqs/merged16S.good.good.count_table1, processors=8)
_Start End NBases Ambigs Polymer NumSeqs
Minimum: 2067 7740 1 0 1 1
2.5%-tile: 4118 7742 4 0 2 19916
25%-tile: 7028 7742 8 0 2 199157
Median: 7028 7742 8 0 2 398313
75%-tile: 7028 7742 8 0 2 597469
97.5%-tile: 7736 7742 54 0 2 776710
Maximum: 7742 7742 113 0 8 796625
Mean: 6846.2 7742 10.9162 0 2.01156
of unique seqs: 345671
total # of seqs: _
Can anyone help me get my head around this and so that I can understand how to correctly use the screen.seqs command, and help me understandwhat I am doing wrong?
Thank you,
Chris