Hi,
I am working with published data that I downloaded from NCBI. One issue that I have experienced is that when I use quality scores for quality-filtering, I end up with sequences that are very short (~170bp). I would like to keep longer sequences, so I started using the flowgrams route for quality-filtering, but even before I got to the alignment portion of the mothur 454 SOP I was observing only 1 single unique sequence per sample.
As an example, I have samples SRR606430 and SRR606434. After trim.seqs, I merged the output files. Following is the summary of the merged files if I use the quality score filtering strategy:
using: mothur v.1.35.0
mothur > summary.seqs(fasta=MERGED.shhh.trim.unique.fasta, name=MERGED.shhh.trim.unique.names)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 50 50 0 2 1
2.5%-tile: 1 50 50 0 2 246
25%-tile: 1 58 58 0 3 2455
Median: 1 118 118 0 4 4910
75%-tile: 1 176 176 0 5 7364
97.5%-tile: 1 309 309 0 5 9573
Maximum: 1 426 426 0 7 9818
Mean: 1 131.154 131.154 0 3.89927
of unique seqs: 3867
total # of seqs: 9818
Following is the merged file after trim.seqs, but this time using flowgrams for quality filtering:
mothur > summary.seqs(fasta=Khg.fasta, name=Khg.names)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 270 270 0 4 1
2.5%-tile: 1 270 270 0 4 74
25%-tile: 1 270 270 0 4 740
Median: 1 270 270 0 4 1480
75%-tile: 1 280 280 0 5 2219
97.5%-tile: 1 280 280 0 5 2885
Maximum: 1 280 280 0 5 2958
Mean: 1 274.544 274.544 0 4.45436
of unique seqs: 2
total # of seqs: 2958
I was wondering why the number of unique sequences is so low when I use flowgrams for quality filtering the data. In the example above, when I use quality scores, after all the 454 SOP steps, about 40 different OTUs are kept. I suspect that I am doing something wrong, but I just can’t figure out what. Any advice would be greatly appreciated.
Thank you,
Pedro