strange behaviour align.seqs with samples from same run

Hi,

I have found a problem with some samples from the last pyrosequencing run (all samples were ampliefied with same set of primers). I have followed the 454 sop to get the fasta, names and groups files per sample (allfiles=T in the trim.seqs command). Then I merged fasta, names and groups files according to different experiments (bagazo and purines2), and when I started to process the files (unique.seqs, then align.seqs with silva.bacteria.fasta) I found that after align.seqs samples from purines2 experiment give some strange results as compared with samples from bagazo experiment. I paste below the summary.seqs for the starting fasta, names and groups files and align.seqs for the samples of two experiments. In both experiments sequences are of at least 200 bp but after align.seqs in purines2 I found that too many sequences will be lost because in the aligment are below 200bp. After looking through the forum I could not find a response and I do not know the reason for these results.
I have uploaded to dropbox the files ( https://www.dropbox.com/sh/iz6gz1loj4z4hsf/AACU_AHqyp1ztoOY7TkTKkI1a?dl=0)

summary.seqs(fasta=bagazo.fasta, name=bagazo.names, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 200 200 0 2 1
2.5%-tile: 1 201 201 0 4 1671
25%-tile: 1 212 212 0 4 16704
Median: 1 218 218 0 4 33407
75%-tile: 1 221 221 0 5 50110
97.5%-tile: 1 238 238 0 6 65142
Maximum: 1 459 459 0 8 66812
Mean: 1 218.617 218.617 0 4.30652

of unique seqs: 17854

total # of seqs: 66812

unique.seqs(fasta=bagazo.fasta, name=bagazo.names)
align.seqs(fasta=bagazo.unique.fasta, reference=silva.bacteria.fasta, processors=8)
summary.seqs(fasta=bagazo.unique.align, name=bagazo.unique.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1076 4 0 1 1671
25%-tile: 5456 13125 211 0 4 16704
Median: 6091 13125 218 0 4 33407
75%-tile: 6099 13125 220 0 5 50110
97.5%-tile: 43096 43116 237 0 6 65142
Maximum: 43116 43116 326 0 8 66812
Mean: 7015.83 13799.9 205.111 0 4.15398

of unique seqs: 13514

total # of seqs: 66812

Output File Names:
bagazo.unique.summary


summary.seqs(fasta=purines2.fasta, name=purines2.names, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 200 200 0 2 1
2.5%-tile: 1 201 201 0 3 2492
25%-tile: 1 210 210 0 4 24912
Median: 1 218 218 0 4 49823
75%-tile: 1 227 227 0 5 74734
97.5%-tile: 1 246 246 0 6 97154
Maximum: 1 456 456 0 8 99645
Mean: 1 219.047 219.047 0 4.2945

of unique seqs: 39171

total # of seqs: 99645

Output File Names:
purines2.summary


unique.seqs(fasta=purines2.fasta, name=purines2.names) align.seqs(fasta=purines2.unique.fasta, reference=silva.bacteria.fasta, processors=8) summary.seqs(fasta=purines2.unique.align, name=purines2.unique.names)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 2492
25%-tile: 1044 1106 4 0 1 24912
Median: 6091 13125 14 0 3 49823
75%-tile: 43096 43116 213 0 4 74734
97.5%-tile: 43115 43116 235 0 6 97154
Maximum: 43116 43116 332 0 8 99645
Mean: 17652.5 20364.8 86.2844 0 2.87403

of unique seqs: 33653

total # of seqs: 99645

Output File Names:
purines2.unique.summary

Can you tell us more about how the data were generated? You seem to have very short reads in general for not doing any quality trimming. Were both libraries sequenced int he same direction? If not, you may have some problems on your hands. You could try using flip=T when you run align.seqs.

pat

All samples came from the same run (plate) and came in the same sff file that was processed following 454 sop protocol (but with order=B). Anyway I have asked to sequencer facility how the libraries were done.
I have tried filp=T with align.seqs and found the same problem:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1051 2 0 1 2492
25%-tile: 1044 1122 9 0 2 24912
Median: 6091 13125 20 0 3 49823
75%-tile: 43055 43116 213 0 4 74734
97.5%-tile: 43115 43116 235 0 6 97154
Maximum: 43116 43116 332 0 8 99645
Mean: 17768.8 20494.1 89.5823 0 3.16482

of unique seqs: 33653

total # of seqs: 99645

I do not understand why should I have very short reads. Before doing align.seqs, the summary of the fasta (from merge samples after trim.seqs) file shows that all sequences are at least 200 bp long:

tart End NBases Ambigs Polymer NumSeqs
Minimum: 1 200 200 0 2 1
2.5%-tile: 1 201 201 0 3 2492
25%-tile: 1 210 210 0 4 24912
Median: 1 218 218 0 4 49823
75%-tile: 1 227 227 0 5 74734
97.5%-tile: 1 246 246 0 6 97154
Maximum: 1 456 456 0 8 99645
Mean: 1 219.047 219.047 0 4.2945

of unique seqs: 39171

total # of seqs: 99645

In fact, after shhh.flows I found that most of sequences are at least 200bp long:

summary.seqs(fasta=IZ1TX7302.shhh.fasta, name=IZ1TX7302.shhh.names)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 64 64 0 2 1
2.5%-tile: 1 218 218 0 3 9893
25%-tile: 1 236 236 0 4 98923
Median: 1 245 245 0 4 197846
75%-tile: 1 252 252 0 5 296768
97.5%-tile: 1 275 275 0 6 385798
Maximum: 1 504 504 0 14 395690
Mean: 1 244.198 244.198 0 4.1886

of unique seqs: 159433

total # of seqs: 395690

Output File Names:
IZ1TX7302.shhh.summary

Output File Names:
purines2.unique.summary

Hope this helps

Mi provider told me that all samples were sequenced in the same direction, with the same primers and in the same run

If you look in the align report file you can see the issue is a poor match in the reference. For example:

IZ1TX7302GQ8VY
CAATCCAGCCAGCAAAATGAATTTGAGCAATCGCATCGAACTTCGGAAATCCTTTCGTCACCATCCCAGTTACAGGCCGAAACCGGGCGACGAAGGGCGGGCAGATTGCCCCGGTGTGAAAAAATCACAAGGCAATTTTTTGCTTTTTTGGGTTAGGTGAAATTGTGAAGCAGCAGTTATGCAAAGGTTAAGCTGGACACCGTTTGTCCCGATTTAACTGCCGGAGTCTCGT

QueryName IZ1TX7302GQ8VY
QueryLength 232
TemplateName AY444980.1
TemplateLength 1411
SearchMethod kmer
SearchScore 9.33
AlignmentMethod needleman
QueryStart 232
QueryEnd 232
TemplateStart 1
TemplateEnd 1
PairwiseAlignmentLength 1
GapsInQuery 0
GapsInTemplate 0
longestInsert 0
SimBtwnQuery&Template 50.00

This causes a poor alignment.

Could matching be improved using the 7.2 Gb version of silva reference file? How much RAM should I have to run align.seqs with this reference file? My computer has 16Gb of RAM.

Thank you very much for your help