mothur

filter.seqs gives the error: Sequences are not all the same length, please correct.

Hi,

I’m getting the below error when running filter.seqs with vertical=T and trump=.
I’ve looked around on the forum, but I haven’t found any solution. I’ve tried using e.g. processors=1.
I used the silva.bacteria.fasta as reference alignment in align.seqs prior to screen.seqs.

Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
[ERROR]: Could not open /home/jbjork/496_515_seqs_2Bjork.small.pick.pick.unique.good.align28602filterValues.temp

So, there seem to be 4 sequences that aren’t the same length.

Before running filter.seqs, I ran screen.seqs with the start=13862, end=21287 and minlength=125 parameters, which using the summary function produced;

Using 1 processors.

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 11890 21287 125 0 3 1
2.5%-tile: 13862 21287 125 0 3 15480
25%-tile: 13862 21287 125 0 4 154796
Median:  13862 21287 125 0 4 309591
75%-tile: 13862 21287 125 0 5 464386
97.5%-tile: 13862 21293 125 0 6 603702
Maximum: 13862 28112 186 0 15 619181
Mean: 13862 21287.3 125 0 4.28841
# of Seqs: 619181

update

Even trying to force screen.seqs to only save sequences with the same length, both minlength and maxlength = 125 did cause the same error in filter.seqs. Basically ended up with the same set of sequences.

Using 1 processors.

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 11890 21287 125 0 3 1
2.5%-tile: 13862 21287 125 0 3 15480
25%-tile: 13862 21287 125 0 4 154796
Median:  13862 21287 125 0 4 309591
75%-tile: 13862 21287 125 0 5 464386
97.5%-tile: 13862 21293 125 0 6 603701
Maximum: 13862 21341 125 0 15 619180
Mean: 13862 21287.2 125 0 4.28841
# of Seqs: 619180

Or using optimize=start-end

Using 1 processors.

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 11890 21284 119 0 3 1
2.5%-tile: 13862 21284 125 0 3 16271
25%-tile: 13862 21287 125 0 4 162706
Median:  13862 21287 125 0 4 325412
75%-tile: 13862 21287 125 0 5 488117
97.5%-tile: 13862 21293 125 0 6 634552
Maximum: 13862 28112 186 0 15 650822
Mean: 13862 21287.1 125 0 4.30316
# of Seqs: 650822

Results in filter.seqs error

Using 1 processors.
Creating Filter... 
Sequences are not all the same length, please correct.

Can you see where the problem lies?

Any help on this would be much appreciated, thanks.

We have this issue on the list to resolve in the next release. If you run the filter.seqs command in debug mode, you should be able to find the few sequences that are causing the issue.

mothur > set.dir(debug=t)
mothur > filter.seqs(…)

Hi Sarah, thanks for the reply.

I tried it out, in command-line (and not on cluster), but it only have me the information I already knew: Sequences are not all the same length, please correct.

How can I find out which sequences are causing the problem?

Many thanks,
J

Since it the error indicate that some sequences are either too short or long, I extracted those using awk

awk ‘$4 < 125 {print $1}’ 496_515_seqs_2Bjork.small.pick.pick.unique.summary > bad.seqs.accnos
awk ‘$4 > 125 {print $1}’ 496_515_seqs_2Bjork.small.pick.pick.unique.summary >> bad.seqs.accnos

and removed those from the alignment
remove.seqs(fasta=./496_515_seqs_2Bjork.small.pick.pick.unique.align, accnos=./bad.seqs.accnos)

then screened using start, end and minlength
screen.seqs(fasta=./496_515_seqs_2Bjork.small.pick.pick.unique.align.pick, start=13862, end=21287, minlength=125, maxlength=125)

However, this still caused the error in filter.seqs, even though

awk ‘$4 < 125 {print $1}’ 496_515_seqs_2Bjork.small.pick.pick.unique.align.pick.good.summary | wc -l
0
awk ‘$4 > 125 {print $1}’ 496_515_seqs_2Bjork.small.pick.pick.unique.align.pick.good.summary | wc -l
0

indicating that all sequences are 125bp.

total # seqs with 125bp
awk ‘$4 == 125 {print $1}’ 496_515_seqs_2Bjork.small.pick.pick.unique.align.pick.good.summary | wc -l
619179

I was thinking of filtering the sequences using sed, removing the - and . character, but this will obviously mess up the alignment.

Any ideas on how I can proceed is highly appreciated.

With the debug flag set, you should see output like:

[DEBUG]: yourSequenceName length = yourSequencesLength
[DEBUG]: yourSequenceName length = yourSequencesLength
[DEBUG]: yourSequenceName length = yourSequencesLength

or

[DEBUG]: 1_F003D000 length = 235

This information should be in the log file as well. You can search the log file to find the sequences that are not the same length as all the others.

Thanks for replying.

No, I do not get any additional information.

Output in window

mothur > set.dir(debug=T)

Setting [DEBUG] flag.

mothur > filter.seqs(fasta=./496_515_seqs_2Bjork.small.pick.pick.unique.align.pick.good, vertical=T, trump=.)

Using 1 processors.
Creating Filter…
100
200
300
.
.
.
Sequences are not all the same length, please correct.
Segmentation fault

Output in logfile

mothur > set.dir(debug=T)
Setting [DEBUG] flag.

mothur > filter.seqs(fasta=./496_515_seqs_2Bjork.small.pick.pick.unique.align.pick.good, vertical=T, trump=.)

Using 1 processors.
Creating Filter…
Sequences are not all the same length, please correct.

I’m using mothur v.1.31.1

Many thanks for helping me solve this.

set.dir(debug=t) works as you said in a newer version of mothur. Should obviously have upgraded earlier sigh

Are there any updates on when the next version will come out, fixing the above bug? I have run into the same problem.

Thank you so much and Happy New Year!

Katherine

Katherine,

Are you using 1.36? Can you post the output of using the debug mode as Sarah outlined above?

Pat

I am using the current version of mothur (1.43.0) and am having the same issue.

mothur > filter.seqs(fasta=second.m.s.u.g.align, vertical=T, trump=., processors=8)

Using 8 processors.
Creating Filter…
[ERROR]: Sequences are not all the same length, please correct.
20
19
18
8
19
12
20
10
It took 0 secs to create filter for 126 sequences.

It took 0 seconds to run 1 commands from your script.
I used the raw files without trimming.

I used raw files without trimming. I used V3-V4 region for sequencing and here is the summary from alignment steps ;

mothur > summary.seqs(fasta=second.m.s.u.align, count=second.m.s.count_table, processors=2)

Using 2 processors.

            Start   End     NBases  Ambigs  Polymer NumSeqs                                                                                                                                 

Minimum: 1046 1047 1 0 1 1
2.5%-tile: 6388 25316 440 0 4 77548
25%-tile: 6388 25316 441 0 5 775472
Median: 6388 25316 449 0 6 1550943
75%-tile: 6388 25316 464 0 6 2326414
97.5%-tile: 6389 25316 466 0 8 3024338
Maximum: 43116 43116 475 0 79 3101885
Mean: 6408 25306 451 0 5

of unique seqs: 2098941

total # of seqs: 3101885

What might be the problem?

you need to run screen.seqs(fasta=second.m.s.u.align, start=6338, end=25316) before you run this filter.seqs

I run screen.seqs as
mothur “#screen.seqs(fasta=second.m.s.u.align, count=second.m.s.count_table,summary=second.m.s.u.summary, start=6388, end=25316, maxhomop=8, processors=8)”

from that i got the output as ;
Running command: remove.seqs(accnos=second.m.s.u.bad.accnos.temp, count=second.m.s.count_table)
Removed 118982 sequences from your count file.

Output File Names:
second.m.s.pick.count_table

/******************************************/

Output File Names:
second.m.s.u.good.summary
second.m.s.u.good.align
second.m.s.u.bad.accnos
second.m.s.good.count_table

Then For my ease i rename good to g using the command;
rename good g good
Finally, I run the command filter.seqs
mothur “#filter.seqs(fasta=second.m.s.u.g.align, vertical=T, trump=., processors=8)”
and got that error