filter.seqs : Length of filtered alignment problem

Hello all,

I work on data Miseq 16S bacteria V3-V4.
I have a problem with filter.seqs order; after his execution, I find in the fasta file, the followings result and I don’t see why? :oops: :

mothur > filter.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.good.align, vertical=T, trump=., processors=20)

Using 20 processors.
Creating Filter…


Running Filter...

Length of filtered alignment: 65 Number of columns removed: 49935 Length of the original alignment: 50000 Number of sequences used to construct filter: 123

Output File Names:
/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.filter
/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.good.filter.fasta

here of earlier orders…
#align.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.fasta, reference=/home/kdiallo/data_OHP/result_files/pipeline4/silva.bacteria.v34.fasta, flip=t, processors=20)
#summary.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.align, count=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.count_table, processors=20)

mothur > summary.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.align, count=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.count_table, processors=20)

Using 20 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: -1 -1 0 0 1 1
2.5%-tile: 1044 1046 2 0 1 279193
25%-tile: 6387 6443 7 0 2 2791929
Median: 6428 23440 14 0 3 5583858
75%-tile: 22581 23440 407 0 4 8375786
97.5%-tile: 23488 23490 500 2 49 10888522
Maximum: 23488 23490 500 2 49 11167714
Mean: 8929.87 12260.4 79.6921 0.0131782 2.33932

of unique seqs: 5424889

total # of seqs: 11167714

#screen.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.align, count=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.count_table, start=23488, end=23490, processors=20)

filter.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.good.align, vertical=T, trump=., processors=20)

Thanks for your help.
Karim

First, if you are going to use V3-V4 you probably want to see this:

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Second, it is because of how you set the parameters in screen.seqs. You indicated that you did this:

screen.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.align, count=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.count_table, start=23488, end=23490, processors=20)

You probably really want this:

screen.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.align, count=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.count_table, start=6428, end= 23488, processors=20, maxambig=0, maxhomop=8)

Thanks Pat for your reply.

So one the first point about V3-V4 region. I looked at the page. I totally agree with you about the importance of data quality and problems regarding this region. But in fact the data was generated by my intership supervisor before. So I try to treat them, in the best possible.
On the second point; I remade the screen.seqs with the parameters start=23488, end=23490. But I always find the same results after filter.seqs.

Here are the commands and results:

mothur > screen.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.align, count=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.count_table, start=23488, end=23490, processors=20)

Using 20 processors.

Removing group: T0AInitial2501 because all sequences have been removed.

Removing group: T0CInitial2507 because all sequences have been removed.

Removing group: T0CInitial2508 because all sequences have been removed.

Removing group: T0PInitial2509 because all sequences have been removed.

Removing group: T0PInitial2511 because all sequences have been removed.

Removing group: T0PInitial2512 because all sequences have been removed.

Removing group: T0QInitial2514 because all sequences have been removed.

Removing group: T0QInitial2515 because all sequences have been removed.

Output File Names:
/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.good.align
/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.bad.accnos
/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.good.count_table


It took 1323 secs to screen 5424889 sequences.

mothur > quit()

And filter.seqs:
mothur > filter.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.good.align, vertical=T, trump=., processors=20)

Using 20 processors.
Creating Filter…


Running Filter...

Length of filtered alignment: 65 Number of columns removed: 49935 Length of the original alignment: 50000 Number of sequences used to construct filter: 123

Output File Names:
/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.filter
/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.good.filter.fasta


mothur > quit()

i tested screen.seqs with others parameters, but without success. Ideas or others suggestions…?

What happens if you do this?

screen.seqs(fasta=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.unique.align, count=/home/kdiallo/data_OHP/result_files/pipeline4/stability_files.trim.contigs.good.count_table, start=6428, end= 23488, processors=20, maxambig=0, maxhomop=8)

Hi @pschloss, Is there a specific reason to choose 6428 as a start, and 23488 as an end?
I’m facing the same problem of excessive filtration of sequences with my data.

Thanks.