make.shared will only produce data at the ~99% rRNA sequence

I am following the 454 SOP almost perfectly and I have 3 data sets for the same sample, bacteria, fungi and algae. Everything is working fine for fungi and Algae.

However during the make.shared step with the bacterial sequences I use:

make.shared(list=final.an.list, group=final.groups, label=0.03)

And it outputs:

0.01
Output File Names:
final.an.shared
final.an.E1.rabund
final.an.E2.rabund
final.an.E4.rabund
final.an.E5.rabund
final.an.FE.rabund
final.an.GR.rabund
final.an.M.rabund

I can’t get it to stick with 0.03.

I’ve reprocessed this data set 3 times to make sure it’s not something weird that I’m doing, and it isn’t.

I should also mention that earlier in this data set while running filter.seqs I have to run it as

filter.seqs(fasta=sfffiles.unique.good.align, processors=7)

Otherwise I loose all of my columns.

Are the two problems related? Anything I can change to keep it at a 0.03 level?

“Almost perfectly” - what does that mean? What are you changing?

I think a few things are happening. First, the filter.seqs command is doing very little to help if you remove the trump=. option. When it wipes out all of your columns, that is because your settings in screen.seqs were incorrect. If you post the output from running summary.seqs before screen.seqs and the screen.seqs command I can help you find the right parameters. I think the inability to get to 0.03 is related to you having sequences that don’t fully overlap with each other because of the screen.seqs/filter.seqs problem. It might also be related to the cutoff you used. Because of quirks in the algorithm, it is important to use a cutoff in dist.seqs of 0.15 or so.

Pat

Almost perfectly, means that the sequencing centre sequenced my DNA from the F and the R primers (I know don’t say it).

SO to deal with this I used sff.multiple(file=sfffiles.txt, order=B, minflows=250, maxflows=720, pdiffs=5, bdiffs=2, maxhomop=8, minlength=200, flip=F, processors=7) for the forward Primer.

Then sff.multiple(file=sfffiles.txt, order=B, minflows=250, maxflows=720, pdiffs=5, bdiffs=2, maxhomop=8, minlength=200, flip=T, processors=7) for the reverse primer, then I used Merge.files to combine the two datasets. I know this is likely where most of my troubles originate from. Interestingly this approach seems to have worked fine for the ITS and the 23LSU, probably because the sequences are shorter from the F to the R primer and they overlap more nicely.

Here is the summary.seqs before the screen.seqs

Using 7 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 6451 321 0 4 553
25%-tile: 1046 7943 343 0 4 5530
Median: 1773 13127 354 0 5 11059
75%-tile: 2072 13127 367 0 5 16588
97.5%-tile: 3157 13133 388 0 7 21565
Maximum: 5260 14987 476 0 8 22117
Mean: 1639.81 10612.4 354.991 0 4.83601

of unique seqs: 16718

total # of seqs: 22117

You see what I mean.

I did use 0.15 in dist.seqs.

Any ideas?

I should also mention, that this is only a problem when I process as a batch using sff.multiple. If I do them individually this is not an issue.

Two things…

  1. I agree that the problems are because you have a longer fragment and the reads don’t overlap. Because of that I’d analyze the reads separately (sorry!)

  2. For shhh.flows to do its job, you really need minflows and maxflows to be equal. Otherwise there isn’t much denoising going on.