filter.seqs - potential bug?

Hi,
I am running version v.1.31.2 - 64Bit compilation on a linux machine, attempting to finish processing of fasta files produced by a MiSeq run. I can successfully align and screen the sequences. The resulting set of aligned sequences are all of the same length (based on summary.seqs) but when I attempt to remove the spaces and uninformative regions with filters.seqs, I get an error about sequence lengths.

FYI, the summary.seqs command and output are:
mothur > summary.seqs(fasta=AB_23.trim.unique.good.align, processors=10)

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 13862 26158 300 0 4 2842
25%-tile: 13862 26160 300 0 5 28415
Median: 13862 26160 300 0 6 56830
75%-tile: 13862 26162 300 0 7 85244
97.5%-tile: 13862 26165 300 0 9 110817
Maximum: 13862 26176 300 78 44 113658
Mean: 13861.4 26159.7 299.986 0.0154147 6.3702

of Seqs: 113658

While the filter.seqs command and output are:

mothur > filter.seqs(fasta=AB_23.trim.unique.good.align, vertical=T, trump=., processors=10)

Using 10 processors.
Creating Filter…
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.
Sequences are not all the same length, please correct.

Also, I attempted to run the same command using 1 processor but I get the same error with regards to sequence length.

Thanks for your help!

1 Like

The summary.seqs results look like you have a sequence with no bases, “Minimum: 0 0 0 0 1 1”. Something odd seems to be going on. Can you post the summary.seqs results after align.seqs before screen.seqs?

Hello and thanks for your quick reply. I am attaching the summary.seqs results you requested (before screen.seqs). I should also mention that in the process of running filter.seqs, the output created (filter.summary) lists sequences as being 299 bp long when summary.seqs suggests they aren’t.

mothur > summary.seqs(fasta=AB-23.trim.unique.align, processors=10)


Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 13862 26152 300 0 4 28693
25%-tile: 13862 26158 300 0 5 286922
Median: 13862 26160 300 0 6 573843
75%-tile: 13862 26162 300 0 7 860764
97.5%-tile: 13862 26167 300 0 9 1118993
Maximum: 43116 43116 300 89 67 1147685
Mean: 14046.6 26100.6 294.176 0.0147497 6.22842

of Seqs: 1147685

Looks like the bad sequence came from the aligning step. Did you get any warnings from mothur when you aligned? Did you set flip=t in the align.seqs command? When flip=t align.seqs will align the forward and reverse of the sequence and then choose the better alignment. Did you use a minlength value for screen.seqs? Could you set minlength=200 in screen.seqs and then post the resulting summary.seqs and filter.seqs output?

Hi again,
Thanks for the suggestion. I went ahead and re-ran the alignment using flip=t and re-ran the screen.seqs command with minlength=200. Here are the results of the summary.seqs on that output:

mothur > summary.seqs(fasta=AB-23.trim.unique.good.align, processors=10)

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 13862 26158 300 0 4 5171
25%-tile: 13862 26160 300 0 5 51705
Median: 13862 26160 300 0 6 103409
75%-tile: 13862 26162 300 0 7 155113
97.5%-tile: 13862 26165 300 0 9 201646
Maximum: 13862 26388 300 78 44 206816
Mean: 13861.5 26159.9 299.988 0.0119865 6.39287

of Seqs: 206816

You can clearly see there is still one sequence with 0 bps. However, I don’ think that’s the problem. When I then proceed wit h the filter.seqs command, I still get the error that the sequences are not all the same length and a failure in the command but in the output created (filter.summary), you see lists of sequences which suggest there are 300 as well as 299 bps length sequences - for example, see this excerpt:

M00532_31_000000000-A1L3U_1_1102_17117_27667 1 432 300 0 5 1
M00532_31_000000000-A1L3U_1_1102_16217_28323 1 432 300 0 6 1
M00532_31_000000000-A1L3U_1_1103_16256_3626 1 432 300 0 8 1
M00532_31_000000000-A1L3U_1_1103_11278_3736 1 432 300 0 6 1
M00532_31_000000000-A1L3U_1_1103_21661_5537 1 432 300 0 8 1
M00532_31_000000000-A1L3U_1_1103_9665_6775 1 432 300 0 5 1
M00532_31_000000000-A1L3U_1_1103_13361_7512 1 432 299 0 6 1
M00532_31_000000000-A1L3U_1_1103_11601_8576 1 432 300 0 6 1
M00532_31_000000000-A1L3U_1_1103_11710_10120 1 432 300 0 5 1
M00532_31_000000000-A1L3U_1_1103_26564_10217 1 432 300 0 7 1
M00532_31_000000000-A1L3U_1_1103_23471_11484 1 432 300 0 6 1
M00532_31_000000000-A1L3U_1_1103_26291_11540 1 432 300 0 6

The 299 vs 300 is the number of bases in the sequence, not necessarily the aligned length. The aligned length includes the gaps. Could you send your aligned fasta file to mothur.bugs@gmail.com, so I can try to reproduce and troubleshoot the issue?

HI,

Was there any resolution to this? I’m running in the same issue. It didn’t seem like this was an issue when I was following the 454 SOP where the End values differed, but you can still carry on filter. seqs.

Here’s my output after align.seqs and screen.seqs:

mothur > summary.seqs(fasta=allenv.shhh.cat.unique.kmer6.good.align, name=allenv.shhh.cat.good.names)


Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 222 205 0 3 1
2.5%-tile: 1 222 211 0 4 33736
25%-tile: 1 222 213 0 4 337357
Median: 1 222 216 0 4 674714
75%-tile: 1 222 217 0 5 1012070
97.5%-tile: 1 222 220 0 5 1315691
Maximum: 1 231 231 0 8 1349426
Mean: 1 222.001 215.358 0 4.34434

of unique seqs: 120158

total # of seqs: 1349426

Output File Names:
allenv.shhh.cat.unique.kmer6.good.summary

In the past (to test try), I’ve done a second screen.seqs and/or trim.seqs but to no avail. And, I’m thinking I shouldn’t be doing trim.seqs since it trims the whole sequences and so now I end up with different End values rather than the same End values as above (except for the Max).

Thanks!

holly

i’m not sure there is a problem - what is the sign that there is a problem?

after running filter.seqs you can have sequences start/end at slightly different positions. this happens because there was a base up/downstream of that position, and the base at the new starting position is missing.

pat