mothur

Screen.seqs optimize=start-end makes a weird choice

#1

Hi there,

I’m running a V3V4 dataset with my mothur-based pipeline and during screen.seqs with optimize=start-end. Interestingly, mothur (v1.39.5) makes an odd choice for it, which results in losing all but 1 nt in almost all seqs. See below the details.

I know that it’s an old version but I haven’t seen any mention about this other than force specific start and end positions. The forum examples I saw weren’t as extreme as this one though

mothur > align.seqs(fasta=current, reference=/share/amplicon/otureporter/db/silva.nr_v132.v3v4.align, processors=12, flip=T)
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 12 processors.

Reading in the /share/amplicon/otureporter/db/silva.nr_v132.v3v4.align template sequences...    DONE.
It took 77 to read  213119 sequences.
Aligning sequences from /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.fasta ...
[WARNING]: Some of your sequences generated alignments that eliminated too many bases, a list is provided in /results/amplicon/NGU6358/mot
hur_run/NGU6358.trim.contigs.good.unique.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 3470 secs to align 1748916 sequences.


Output File Names: 
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.align
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.align.report
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.flip.accnos


mothur > summary.seqs(fasta=current, count=current)
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.count_table as input file for the count parameter.
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.align as input file for the fasta parameter.

Using 12 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      10      270     7       0       2       150067
25%-tile:       1423    17011   348     0       4       1500662
Median:         1423    17011   371     0       6       3001324
75%-tile:       1423    17011   371     0       6       4501986
97.5%-tile:     16964   17011   372     0       7       5852581
Maximum:        17011   17012   472     0       8       6002647
Mean:   2080.85 15904.5 324.537 0       5.03819
# of unique seqs:       1748916
total # of seqs:        6002647

Output File Names: 
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.summary

It took 85 secs to summarize 6002647 sequences.

mothur > screen.seqs(fasta=current, count=current, summary=current, optimize=start-end, criteria=95, maxhomop=8)
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.count_table as input file for the count parameter.
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.align as input file for the fasta parameter.
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.summary as input file for the summary parameter.

Using 12 processors.
Optimizing start to 1424.
Optimizing end to 270.

Output File Names: 
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.summary
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.align
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.bad.accnos
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.good.count_table


It took 381 secs to screen 1748916 sequences.

mothur > filter.seqs(fasta=current, vertical=T, trump=.)
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.align as input file for the fasta parameter.

Using 12 processors.
Creating Filter... 


Running Filter... 



Length of filtered alignment: 0
Number of columns removed: 17012
Length of the original alignment: 17012
Number of sequences used to construct filter: 1628144

Output File Names: 
/results/amplicon/NGU6358/mothur_run/NGU6358.filter
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.filter.fasta


mothur > unique.seqs(fasta=current, count=current)
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.good.count_table as input file for the count parameter.
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.filter.fasta as input file for the fasta parameter.
1628144 1


Output File Names: 
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.filter.count_table
/results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.filter.unique.fasta


mothur > summary.seqs(fasta=current, count=current)
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.filter.count_table as input file for the count parameter.
Using /results/amplicon/NGU6358/mothur_run/NGU6358.trim.contigs.good.unique.good.filter.unique.fasta as input file for the fasta parameter.

Using 12 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      0       0       0       0       1       134233
25%-tile:       0       0       0       0       1       1342330
Median:         0       0       0       0       1       2684660
75%-tile:       0       0       0       0       1       4026989
97.5%-tile:     0       0       0       0       1       5235086
Maximum:        -1      -1      0       0       1       5369318
Mean:   0       0       0       0       1
# of unique seqs:       1
total # of seqs:        5369318
#2

I have never used the optimize option myself, but the choice by itself seems reasonable to me. If I get this correctly, optimize will choose the 95th percentile for start and end (the latter in reverse order) as parameters, which does seem to fit well.

The weird thing is filter.seqs removing all columns, though. Have you run summary.seqs right after screening? Or had a look at the fasta resulting from it? Both might give a hint to the actual problem.

#3
mothur > summary.seqs(fasta=NGU6358.trim.contigs.good.unique.good.align, processors=12, count=NGU6358.trim.contigs.good.good.count_table)

Using 12 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	270	10	0	1	1
2.5%-tile:	1423	17011	346	0	4	134233
25%-tile:	1423	17011	366	0	4	1342330
Median: 	1423	17011	371	0	6	2684660
75%-tile:	1423	17011	371	0	6	4026989
97.5%-tile:	1423	17011	372	0	7	5235086
Maximum:	1424	17012	472	0	8	5369318
Mean:	1406.99	16819.8	361.329	0	5.33067
# of unique seqs:	1628144
total # of seqs:	5369318

Output File Names: 
NGU6358.trim.contigs.good.unique.good.summary

It took 79 secs to summarize 5369318 sequences.

This is the summary after screen.seqs. It looks pretty normal to me.

I also looked into the *.align file and seems ok too.

#4

The optimize parameter is really a throwback to the 454 days and isn’t needed for MiSeq data. Why not use start=1423, end=17011?

Pat

#5

I didn’t know it was really a 454 thing…
I’ll do that. I was using optimize=start-end mostly because it was easier than setting different start and end combinations for different primer sets in our automated pipeline.
I never had this issue processing V1-V3 or V4 data so I was a bit concerned.
Cheers,