filtering data based on flowgrams

Hi,

I am working with published data that I downloaded from NCBI. One issue that I have experienced is that when I use quality scores for quality-filtering, I end up with sequences that are very short (~170bp). I would like to keep longer sequences, so I started using the flowgrams route for quality-filtering, but even before I got to the alignment portion of the mothur 454 SOP I was observing only 1 single unique sequence per sample.

As an example, I have samples SRR606430 and SRR606434. After trim.seqs, I merged the output files. Following is the summary of the merged files if I use the quality score filtering strategy:

using: mothur v.1.35.0

mothur > summary.seqs(fasta=MERGED.shhh.trim.unique.fasta, name=MERGED.shhh.trim.unique.names)

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 50 50 0 2 1
2.5%-tile: 1 50 50 0 2 246
25%-tile: 1 58 58 0 3 2455
Median: 1 118 118 0 4 4910
75%-tile: 1 176 176 0 5 7364
97.5%-tile: 1 309 309 0 5 9573
Maximum: 1 426 426 0 7 9818
Mean: 1 131.154 131.154 0 3.89927

of unique seqs: 3867

total # of seqs: 9818

Following is the merged file after trim.seqs, but this time using flowgrams for quality filtering:

mothur > summary.seqs(fasta=Khg.fasta, name=Khg.names)

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 270 270 0 4 1
2.5%-tile: 1 270 270 0 4 74
25%-tile: 1 270 270 0 4 740
Median: 1 270 270 0 4 1480
75%-tile: 1 280 280 0 5 2219
97.5%-tile: 1 280 280 0 5 2885
Maximum: 1 280 280 0 5 2958
Mean: 1 274.544 274.544 0 4.45436

of unique seqs: 2

total # of seqs: 2958

I was wondering why the number of unique sequences is so low when I use flowgrams for quality filtering the data. In the example above, when I use quality scores, after all the 454 SOP steps, about 40 different OTUs are kept. I suspect that I am doing something wrong, but I just can’t figure out what. Any advice would be greatly appreciated.

Thank you,

Pedro

Can you post the exact commands you are running starting with sffinfo for each approach?

Pat

Hi Pat,

Thank your for the quick reply!

For the quality-score route, the commands are (until trim.seqs):

sffinfo(sff=SRR606430.sff, flow=T)
sffinfo(sff=SRR606434.sff, flow=T)
trim.seqs(fasta=SRR606430.fasta, oligos=oligos.oligos, qfile=SRR606430.qual, maxambig=0, maxhomop=8, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50, processors=2)
trim.seqs(fasta=SRR606434.fasta, oligos=oligos.oligos, qfile=SRR606434.qual, maxambig=0, maxhomop=8, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50, processors=2)
merge.files(input=SRR606430.trim.fasta-SRR606434.trim.fasta, output=MERGED.shhh.trim.fasta)
merge.files(input=SRR606430.groups-SRR606434.groups, output=MERGED.groups)
unique.seqs(fasta=MERGED.shhh.trim.fasta)
system(mv MERGED.shhh.trim.names MERGED.shhh.trim.unique.names)
summary.seqs(fasta=MERGED.shhh.trim.unique.fasta, name=MERGED.shhh.trim.unique.names)

For the flowgram route (using output files from sffinfo above):

trim.flows(flow=SRR606430.flow, oligos=oligos.oligos, pdiffs=2, bdiffs=1, processors=2)
trim.flows(flow=SRR606434.flow, oligos=oligos.oligos, pdiffs=2, bdiffs=1, processors=2)
shhh.flows(file=SRR606430.flow.files, processors=2, lookup=./reference)
shhh.flows(file=SRR606434.flow.files, processors=2, lookup=./reference)
trim.seqs(fasta=SRR606430.shhh.fasta, name=SRR606430.shhh.names, oligos=oligos.oligos, pdiffs=2, bdiffs=1, maxhomop=8, minlength=200, processors=2)
trim.seqs(fasta=SRR606434.shhh.fasta, name=SRR606434.shhh.names, oligos=oligos.oligos, pdiffs=2, bdiffs=1, maxhomop=8, minlength=200, processors=2)
merge.files(input=SRR606430.shhh.trim.fasta-SRR606434.shhh.trim.fasta, output=Khg.fasta)
merge.files(input=SRR606430.shhh.trim.names-SRR606434.shhh.trim.names, output=Khg.names)
merge.files(input=SRR606430.shhh.groups-SRR606434.shhh.groups, output=Khg.groups)
summary.seqs(fasta=Khg.fasta, name=Khg.names)

for the above, I used the GS FLX titanium lookup file.

Pedro

Hi Pat and Mothur Team,

I think I found the culprit for the differences that I reported above. I was running the commands above in my university’s super computer cluster, where they have Mothur installed. I decided that since the samples that I am processing are quite small, that perhaps I could just run them in my personal computer and check if the results could be reproduced. To my surprise, this is the summary I just got now, running the exact same samples that I am using here as example (see above):

mothur >
summary.seqs(fasta=Khg.shhh.trim.fasta, name=Khg.shhh.trim.names)

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 252 252 0 3 1
2.5%-tile: 1 260 260 0 4 74
25%-tile: 1 266 266 0 5 740
Median: 1 271 271 0 5 1480
75%-tile: 1 275 275 0 5 2219
97.5%-tile: 1 282 282 0 5 2885
Maximum: 1 295 295 0 6 2958
Mean: 1 270.823 270.823 0 4.75896

of unique seqs: 963

total # of seqs: 2958

Output File Names:
Khg.shhh.trim.summary

It took 0 secs to summarize 2958 sequences.

Although the number of sequences kept are still very small compared to the alternative route (quality score route kept close to 3 times more sequences), at least the number of unique sequences is a lot more reasonable than the 2 unique sequences that I was getting before, when running these samples in the super computer. I’m still not sure why mothur is not working properly in the super computer.

In any case, thank you for your availability!