More than 90% sequences lost after screening

Hello,

I had 5520209 sequences after make.contigs but left with only 336589 sequences after “screen.seqs”. I expect sequence length between 370-376 bp. I have pasted the summary after each step.

make.contigs(file=stability.files, processors=4)
mothur > summary.seqs(fasta=current)
Using stability.trim.contigs.trim.fasta as input file for the fasta parameter.

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 51 51 0 2 1
2.5%-tile: 1 120 120 0 3 138006
25%-tile: 1 214 214 1 4 1380053
Median: 1 274 274 2 4 2760105
75%-tile: 1 332 332 10 5 4140157
97.5%-tile: 1 411 411 31 6 5382204
Maximum: 1 498 498 57 249 5520209
Mean: 1 271.079 271.079 6.37958 4.50889

of Seqs: 5520209


trim.seqs(fasta=stability.trim.contigs.fasta, oligos=primer.oligos, pdiffs=2, flip=T) mothur > summary.seqs(fasta=current) Using stability.trim.contigs.trim.fasta as input file for the fasta parameter.

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 15 15 0 2 1
2.5%-tile: 1 93 93 0 4 75324
25%-tile: 1 189 189 0 4 753238
Median: 1 253 253 3 4 1506475
75%-tile: 1 306 306 8 5 2259712
97.5%-tile: 1 373 373 23 6 2937626
Maximum: 1 459 459 57 85 3012949
Mean: 1 245.66 245.66 5.19473 4.53316

of Seqs: 3012949


mothur > screen.seqs(fasta=stability.trim.contigs.trim.fasta, group=stability.contigs.pick.groups, minlength=370) mothur > summary.seqs(fasta=current) Using stability.trim.contigs.trim.good.fasta as input file for the fasta parameter.

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 370 370 0 3 1
2.5%-tile: 1 370 370 0 4 8415
25%-tile: 1 371 371 0 4 84148
Median: 1 372 372 0 5 168295
75%-tile: 1 373 373 0 5 252442
97.5%-tile: 1 376 376 0 6 328175
Maximum: 1 459 459 19 10 336589
Mean: 1 372.151 372.151 0.00623609 4.71698

of Seqs: 336589


This is evident that there are many sequences shorter than 370 bp. My question is what minimum length I can choose or consider for better results? What can be "acceptable" limit?

Thanks for help in advance.
Richa

Hi,

What are you sequencing and with which chemistry? If it’s a 16S region, then I think there are big problems with the data. The quality of your data is quite poor, which is likely making it very difficult to find matches to your barcodes and primers. If you look at the output from make.contigs, you’ll see that at least 75% of your sequences have an ambiguous base call in them. Furthermore, if you expect sequences that are ~370 nt long, then a range between 51 and 498 is way too broad for any 16S region that I know of.

Pat