filter.seqs removes all data

When I run the following command:
filter.seqs(fasta=H51WFGV01.shhh.trim.unique.good.align, vertical=T, trump=., processors=8)

I get the following output:
Length of filtered alignment: 0
Number of columns removed: 50000
Length of the original alignment: 50000
Number of sequences used to construct filter: 5089

The input align file looks OK. Any ideas on what’s happening?

1 Like

Are your sequences overlapping? Could you post the results of running: summary.seqs(fasta=H51WFGV01.shhh.trim.unique.good.align)?

Summary.seqs gives:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 3855 13862 23 0 2 1
2.5%-tile: 5235 13862 32 0 3 128
25%-tile: 5256 13862 279 0 4 1273
Median: 5279 13862 286 0 5 2545
75%-tile: 5310 13862 290 0 5 3817
97.5%-tile: 42993 43116 296 0 6 4962
Maximum: 43017 43116 324 0 8 5089
Mean: 9629.31 17273.4 258.737 0 4.62566

of Seqs: 5089

Do you the file that it output as well?

Your sequences are not overlapping, over 75% end before the remainder start. What parameter options did you run with the screen.seqs command? Can you try: screen.seqs(fasta=yourFasta, name=yourName, group=yourGroup, optimize=start), and post the results of the summary.seqs command again?

1 Like

For screen.seqs my command was:

screen.seqs(fasta=H51WFGV01.shhh.trim.unique.align, name=H51WFGV01.shhh.trim.unique.names, group=H51WFGV01.shhh.groups, end=43116, optimize=start, criteria=95, processors=8)

When I ran using the parameters you recommended I get the following:

mothur > summary.seqs(fasta=H51WFGV01.shhh.trim.unique.good.align, name=H51WFGV01.shhh.trim.unique.good.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 1046 2 0 1 1
2.5%-tile: 1044 1130 29 0 3 1104
25%-tile: 5256 13862 281 0 4 11038
Median: 5256 13862 286 0 5 22075
75%-tile: 5290 13862 289 0 5 33112
97.5%-tile: 5327 13862 293 0 6 43045
Maximum: 5347 14979 324 0 7 44148
Mean: 5039.86 13171.5 272.155 0 4.65434

of unique seqs: 5173

total # of seqs: 44148

Want to try screen.seqs(fasta=H51WFGV01.shhh.trim.unique.align, name=H51WFGV01.shhh.trim.unique.names, group=H51WFGV01.shhh.groups, optimize=start-end, processors=8)? Can you also post the summary.seqs results on the H51WFGV01.shhh.trim.unique.fasta file?

screen.seqs(fasta=H51WFGV01.shhh.trim.unique.align, name=H51WFGV01.shhh.trim.unique.names, group=H51WFGV01.shhh.groups, optimize=start-end, processors=2)

summary.seqs(fasta=H51WFGV01.shhh.trim.unique.good.align, name=H51WFGV01.shhh.trim.unique.good.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 3855 13862 246 0 3 1
2.5%-tile: 5242 13862 273 0 4 1039
25%-tile: 5256 13862 282 0 4 10383
Median: 5256 13862 286 0 5 20765
75%-tile: 5290 13862 289 0 5 31147
97.5%-tile: 5327 13862 293 0 6 40490
Maximum: 5347 14979 324 0 7 41528
Mean: 5272.09 13859.9 285.634 0 4.72698

of unique seqs: 4450

total # of seqs: 41528

Output File Name:
H51WFGV01.shhh.trim.unique.good.align.summary

Can you also post the summary.seqs results on the H51WFGV01.shhh.trim.unique.fasta file?

summary.seqs(fasta=H51WFGV01.shhh.trim.unique.fasta, name=H51WFGV01.shhh.trim.unique.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 231 231 0 3 1
2.5%-tile: 1 263 263 0 4 1225
25%-tile: 1 281 281 0 4 12244
Median: 1 286 286 0 5 24487
75%-tile: 1 289 289 0 5 36730
97.5%-tile: 1 293 293 0 6 47748
Maximum: 1 324 324 0 8 48972
Mean: 1 283.286 283.286 0 4.75125

of unique seqs: 6333

total # of seqs: 48972

Output File Name:
H51WFGV01.shhh.trim.unique.fasta.summary

The results with optimize=start-end are overlapping and filter.seqs should no longer remove all your sequences. I asked for the wrong file in my last post, I meant the summary.seqs output for fasta=H51WFGV01.shhh.trim.unique.align. Looking at the results from the summary.seqs command may help us pick a better start and end value.

Hi!
I have a similar problem when analyzing my data.
After unique.seqs function I have:

  • Start End NBases Ambigs Polymer NumSeqs
    Minimum: -1 -1 0 0 1 1
    2.5%-tile: 0 0 0 0 1 2861
    25%-tile: 1044 1044 1 0 1 28602
    Median: 1044 1051 3 0 1 57203
    75%-tile: 43096 43116 11 0 2 85804
    97.5%-tile: 43115 43116 23 0 4 111544
    Maximum: 43116 43116 50 0 6 114404
    Mean: 16044.1 16057.4 5.44179 0 1.86354

of unique seqs: 20371

total # of seqs: 114404

Output File Names:
/media/jmarcelino/Data/START/analise2/2014/p2014.shhh.trim.unique.summary

It took 34 secs to summarize 114404 sequences.

Then, I run screen.seqs():

 screen.seqs(fasta=p2014.shhh.trim.unique.align, name=p2014.shhh.trim.unique.names, group=p2014.shhh.groups, optimize=start-end, criteria=95, processors=4)

and I get:

  • Start End NBases Ambigs Polymer NumSeqs
    Minimum: 1044 1044 1 0 1 1
    2.5%-tile: 1044 1044 1 0 1 468
    25%-tile: 1044 1051 2 0 1 4672
    Median: 1044 1067 4 0 2 9343
    75%-tile: 43107 43116 11 0 3 14014
    97.5%-tile: 43115 43116 19 0 4 18217
    Maximum: 43115 43116 50 0 6 18684
    Mean: 21118.8 21132.4 5.9635 0 1.96146

of Seqs: 18684

Output File Names:
/media/jmarcelino/Data/START/analise2/2014/p2014.shhh.trim.unique.good.summary

It took 15 secs to summarize 18684 sequences.

After filter.seqs() :

filter.seqs(fasta=p2014.shhh.trim.unique.good.align, vertical=T, trump=., processors=2)

I get:

Length of filtered alignment: 0
Number of columns removed: 50000
Length of the original alignment: 50000
Number of sequences used to construct filter: 18684

Can you help me to solve this problem, please?

I think you are having a problem with your alignment. Are you sure that your sequences are oriented in the correct direction? You might try doing flip=T in align.seqs and try again.

Pat

1 Like

Hi pat,

of course you we’re right, as usual… For some reason, I didn’t wrote flip=T, although I thought I did…
Anyway, thanks A LOT!

Best regards,
JMarcelino