I am following the 454 SOP almost perfectly and I have 3 data sets for the same sample, bacteria, fungi and algae. Everything is working fine for fungi and Algae.
However during the make.shared step with the bacterial sequences I use:
make.shared(list=final.an.list, group=final.groups, label=0.03)
And it outputs:
Output File Names:
I canâ€™t get it to stick with 0.03.
Iâ€™ve reprocessed this data set 3 times to make sure itâ€™s not something weird that Iâ€™m doing, and it isnâ€™t.
I should also mention that earlier in this data set while running filter.seqs I have to run it as
Otherwise I loose all of my columns.
Are the two problems related? Anything I can change to keep it at a 0.03 level?
“Almost perfectly” - what does that mean? What are you changing?
I think a few things are happening. First, the filter.seqs command is doing very little to help if you remove the trump=. option. When it wipes out all of your columns, that is because your settings in screen.seqs were incorrect. If you post the output from running summary.seqs before screen.seqs and the screen.seqs command I can help you find the right parameters. I think the inability to get to 0.03 is related to you having sequences that don’t fully overlap with each other because of the screen.seqs/filter.seqs problem. It might also be related to the cutoff you used. Because of quirks in the algorithm, it is important to use a cutoff in dist.seqs of 0.15 or so.
Almost perfectly, means that the sequencing centre sequenced my DNA from the F and the R primers (I know don’t say it).
SO to deal with this I used sff.multiple(file=sfffiles.txt, order=B, minflows=250, maxflows=720, pdiffs=5, bdiffs=2, maxhomop=8, minlength=200, flip=F, processors=7) for the forward Primer.
Then sff.multiple(file=sfffiles.txt, order=B, minflows=250, maxflows=720, pdiffs=5, bdiffs=2, maxhomop=8, minlength=200, flip=T, processors=7) for the reverse primer, then I used Merge.files to combine the two datasets. I know this is likely where most of my troubles originate from. Interestingly this approach seems to have worked fine for the ITS and the 23LSU, probably because the sequences are shorter from the F to the R primer and they overlap more nicely.
Here is the summary.seqs before the screen.seqs
Using 7 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 6451 321 0 4 553
25%-tile: 1046 7943 343 0 4 5530
Median: 1773 13127 354 0 5 11059
75%-tile: 2072 13127 367 0 5 16588
97.5%-tile: 3157 13133 388 0 7 21565
Maximum: 5260 14987 476 0 8 22117
Mean: 1639.81 10612.4 354.991 0 4.83601
of unique seqs: 16718
total # of seqs: 22117
You see what I mean.
I did use 0.15 in dist.seqs.
I should also mention, that this is only a problem when I process as a batch using sff.multiple. If I do them individually this is not an issue.