screen.seqs after alignment gives problematic results

Hi
I’m working with 61 samples, Illumina, that correspond to 9340562 seqs.
I was able to do all procedures described on http://www.mothur.org/wiki/MiSeq_SOP until I reached the screen.seqs step after alignment.
I already rerun the command several times and I have different results, but in all the number of sequences that remain after screen.seqs command is really really low, despite the summary results looks auspicious. Additionally, even with the start value of 10368, after screening I have sequences starting before, isn’t it odd?

here is a copy of the log file where the summary results are shown and then the results of the screen.seqs.

Thank you for any help

mothur > get.current()

Current files saved by mothur:
fasta=minas_duasamostragens.trim.contigs.good.unique.align
group=minas_duasamostragens.contigs.good.groups
name=minas_duasamostragens.trim.contigs.good.names
count=minas_duasamostragens.trim.contigs.good.count_table
processors=9
summary=minas_duasamostragens.trim.contigs.good.unique.summary

Current working directory: /diag/home/itiago/Minas/

mothur > summary.seqs(fasta=minas_duasamostragens.trim.contigs.good.unique.align, count=minas_duasamostragens.trim.contigs.good.count_table)

Using 9 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: -1 -1 0 0 1 1
2.5%-tile: 10368 28464 411 0 4 220952
25%-tile: 10369 28464 417 0 4 2209519
Median: 10369 28464 418 0 5 4419037
75%-tile: 10369 28464 419 0 6 6628555
97.5%-tile: 10370 28464 420 1 6 8617121
Maximum: 43116 43116 420 2 9 8838072
Mean: 10383 28458.3 417.348 0.0852903 4.96302

of unique seqs: 7614218

total # of seqs: 8838072

Output File Names:
minas_duasamostragens.trim.contigs.good.unique.summary

It took 4353 secs to summarize 8838072 sequences.

mothur > screen.seqs(fasta=minas_duasamostragens.trim.contigs.good.unique.align, count=minas_duasamostragens.trim.contigs.good.count_table, summary=minas_duasamostragens.trim.contigs.good.unique.summary, start=10368, end=28464, processors=20)

Using 20 processors.

Output File Names:
minas_duasamostragens.trim.contigs.good.unique.good.summary
minas_duasamostragens.trim.contigs.good.unique.good.align
minas_duasamostragens.trim.contigs.good.unique.bad.accnos
minas_duasamostragens.trim.contigs.good.good.count_table


It took 5620 secs to screen 7614218 sequences.

mothur > get.current()

Current files saved by mothur:
fasta=minas_duasamostragens.trim.contigs.good.unique.good.align
group=minas_duasamostragens.contigs.good.groups
name=minas_duasamostragens.trim.contigs.good.names
count=minas_duasamostragens.trim.contigs.good.good.count_table
processors=20
summary=minas_duasamostragens.trim.contigs.good.unique.summary

Current working directory: /diag/home/itiago/Minas/

mothur > summary.seqs(fasta=current, count=current)
Using minas_duasamostragens.trim.contigs.good.good.count_table as input file for the count parameter.
Using minas_duasamostragens.trim.contigs.good.unique.good.align as input file for the fasta parameter.

Using 20 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 10245 28464 398 0 3 1
2.5%-tile: 10366 28464 412 0 4 43724
25%-tile: 10368 28464 417 0 4 437237
Median: 10368 28464 418 0 5 874473
75%-tile: 10368 28464 419 0 6 1311709
97.5%-tile: 10368 28464 420 1 7 1705221
Maximum: 10368 29665 420 2 9 1748944
Mean: 10367.8 28464 417.45 0.08753 4.97613

of unique seqs: 1502223

total # of seqs: 1748944

Output File Names:
minas_duasamostragens.trim.contigs.good.unique.good.summary

It took 1064 secs to summarize 1748944 sequences.

#Here I noticed the minimum value that did not matched with the value on command, and the dramatic lost of sequences on the process. So I rerun the command…

mothur > screen.seqs(fasta=minas_duasamostragens.trim.contigs.good.unique.align, count=minas_duasamostragens.trim.contigs.good.count_table, summary=minas_duasamostragens.trim.contigs.good.unique.summary, start=10367, end=28464)

Using 20 processors.

Output File Names:
minas_duasamostragens.trim.contigs.good.unique.good.summary
minas_duasamostragens.trim.contigs.good.unique.good.align
minas_duasamostragens.trim.contigs.good.unique.bad.accnos
minas_duasamostragens.trim.contigs.good.good.count_table


It took 4428 secs to screen 7614218 sequences.

mothur > get.current()

Current files saved by mothur:
fasta=minas_duasamostragens.trim.contigs.good.unique.good.align
group=minas_duasamostragens.contigs.good.groups
name=minas_duasamostragens.trim.contigs.good.names
count=minas_duasamostragens.trim.contigs.good.good.count_table
processors=20
summary=minas_duasamostragens.trim.contigs.good.unique.summary

Current working directory: /diag/home/itiago/Minas/

mothur > summary.seqs(fasta=current, count=current)
Using minas_duasamostragens.trim.contigs.good.good.count_table as input file for the count parameter.
Using minas_duasamostragens.trim.contigs.good.unique.good.align as input file for the fasta parameter.

Using 20 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 10245 28464 404 0 3 1
2.5%-tile: 10366 28464 413 0 4 3719
25%-tile: 10366 28464 417 0 4 37189
Median: 10366 28464 417 0 5 74378
75%-tile: 10366 28464 419 0 6 111566
97.5%-tile: 10366 28464 420 1 7 145036
Maximum: 10367 28478 420 2 9 148754
Mean: 10366 28464 417.39 0.102048 5.0068

of unique seqs: 132441

total # of seqs: 148754

Output File Names:
minas_duasamostragens.trim.contigs.good.unique.good.summary

It took 77 secs to summarize 148754 sequences.

#Again the number of starting point and now the number of sequences is even lower…

Hi Itiago

Try using start=10370 and end=28464 (also, you want maxambig=0). The start position is the position you want your sequences to start at or before and the end is the position you want them to end at or after. Also, I noticed that you are not sequencing the V4 region and that your reads do not fully overlap. This will likely cause you bigger headaches as you proceed. You’ll want to see this:

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Pat

Hi Pat
Thank you for your reply!
I will try start=10370.
Other thing is how could you say that my sequences don’t overlap? They were obtained by using 518F and 926R, and the pair ends had ~245bp each that yield approximately 30bp for overlapping, and the contig.seqs provided ~420bp sequences.
I’m trying to use the V102 Silva.bacteria.fasta to use for alignment, before I used released V119.
And that gives another question, I tried to do your approach to prune the reference file V119 with pcr.seqs on this file and got this
Using 30 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 16979 375 0 4 1
2.5%-tile: 196 18607 461 0 4 3833
25%-tile: 196 18607 466 0 5 38327
Median: 196 18607 472 0 5 76654
75%-tile: 200 18607 477 0 6 114981
97.5%-tile: 239 18607 642 1 7 149475
Maximum: 1890 19127 1539 5 24 153307
Mean: 206.151 18607 484.734 0.0807334 5.26718

of Seqs: 153307

Output File Names:
silva.nr_v119.v4v5.summary

It took 33 secs to summarize 153307 sequences.

so I used the all file that gave the other results on first post.

Best
Igor

Other thing is how could you say that my sequences don’t overlap?

A 30 nt overlap is hardly an overlap. You will get very poor denoising with your primer choice because the reads do not fully overlap. Again, I’d encourage you to read that blog post as well as the Kozich paper for the effects of partially overlapping data.

Pat