Remaining gaps in the fasta file

Hi everyone,
after alignment to v4 region following the steps of SOP I’ve applied the filter.seqs considering vertical T and trump=. and then unique.seqs once again.
However, the resulting fasta file still remain gaps (-) in there as showed an example below. Probably it is affecting my analysis and the use of the final fasta in picrust, since I received a error message talking about poorly alignment in the fasta file (resulted from many gaps).
How can I solve this?

(example of file “luan.trim.contigs.good.unique.good.filter.unique.fasta”)

>UA231200064738283
T--AC--GG-AG-GGT----GCA-A-G--C-G-T-T--AA-T-CGG-AA--TT-A--C-T--GG-GC--GT--A---AA-GC-GC-AC-------G-CA-G-G-C-G---G--T-CT-G-T-T---AA--G-T-C-A-------G-A-T--G--TG--A-AA-TC--C-C-CG-G-G--------CT-T-AA--------C-C-T-G-G-G-A--A-C----T-G--C-A---T--T---------T--GA-A-A---C------T-G-G--CA--G-G-C-----------T-A-G-A-G-T--C----T-TG----TA-G-A----GTG-G-G---G--T---------AG--A--ATT--------C-C-A-G-GT--GT-A-G-CG-GT--G-A-A-A--TG-C-GT-AG--AG-A-TC-T-G-G-A----G-G-A-AT-A-CC----GG--T--G--GC-GAA-G--G-C--G--G--C-C-C-C--CTT---G--AC-A-A----------------------A-G---A-C-T--GA--CG--C--T-C--A-GG--T-G-CG-A--AA-G-C---G-TG--GG-G--AG-C-A-AA-CA--GG
>UA2312000645130002
T--AC--GT-AG-GGT----GCG-A-G--C-G-T-T--AA-T-CGG-AA--TT-A--C-T--GG-GC--GT--A---AA-GC-GT-GC-------G-CA-G-G-C-G---G--T-TG-T-G-T---AA--G-A-C-A-------G-G-T--G--TG--A-AA-TC--C-C-CG-G-G--------CT-C-AA--------C-C-T-G-G-G-A--A-C----T-G--C-A---T--T---------C--GA-A-A---C------T-G-G--CA--G-G-C-----------T-A-G-A-G-T--C----T-TG----TA-G-A----G-G-G-G---GG-T---------AG--A--ATT--------C-C-C-A-GT--GT-A-G-CG-GT--G-A-A-A--TG-C-GT-AG--AG-A-TT-G-G-G-A----A-G-A-AC-A-TC----GG--T--G--GC-GAAAG--C-G--T--G--C-T-A-C---TG---G--GC-T-G----------------------T-A---T-C-T--GA--CA--C--T-C--A-GG--G-A-CG-A--AA-G-C---T-AG--GG-G--AG-C-G-AA-AG--GG
>UA231200064759023
T--AC--GT-AG-GGT----GCA-A-G--C-G-T-T--AA-T-CGG-AA--TT-A--C-T--GG-GC--GT--A---AA-GC-GT-GC-------GTCA-G-G-C-G---G--T-AA-T-G-T---AA--G-A-C-A-------G-T-T--G--TG--A-AA-TC--C-C-CG-G-G--------CT-C-AA--------C-C-T-G-G-G-A--A-C----T-G--C-A---T--C---------T--GT-G-A---C------T-G-C--AT--T-G-C-----------T-G-G-A-G-T--A----C-GG----CA-G-A----G-G-G-G---GA-T---------GG--A--ATT--------C-C-G-C-GT--GT-A-G-CA-GT--G-A-A-A--TG-C-GT-AG--AT-A-TG-C-G-G-A----G-G-A-AC-A-CC----GA--T--G--GC-GAA-G--G-C--A--A--T-C-C-C--CTG---G--GC-C-T----------------------G-T---A-C-T--GA--CG--C--T-C--A-TG--C-A-CG-A--AA-G-C---G-TG--GG-G--AG-C-A-AA-CA--GG
>UA231200064891860
T--AC--GG-AG-GGT----GCA-A-G--C-G-T-T--AA-T-CGG-AA--TT-A--C-T--GG-GC--GT--A---AA-GC-GC-AC-------G-CA-G-G-C-G---G--T-CT-G-T-C---AA--G-T-C-G-------G-A-T--G--TG--A-AA-TC--C-C-CG-G-G--------CT-C-AA--------C-C-T-G-G-G-A--A-C----T-G--C-A---T--T---------C--GA-A-A---C------T-G-G--CA--G-G-C-----------T-T-G-A-G-T--C----T-TG----TA-G-A----G-G-G-G---GG-T---------AG--A--ATT--------C-C-A-G-GT--GT-A-G-CG-GT--G-A-A-A--TG-C-GT-AG--AG-A-TC-T-G-G-A----G-G-A-AT-A-CC----GG--T--G--GC-GAA-G--G-C--G--G--C-C-C-C--CTG---G--AC-A-A----------------------A-G---A-C-T--GA--CG--C--T-C--A-GG--T-G-CG-A--AA-G-C---G-TG--GG-G--AG-C-A-AA-CA--GG
>UA2312000644152077
G--AC--GG-GG-GGG----GCA-A-G--T-G-T-T--CT-T-CGG-AA--TG-A--C-T--GG-GC--GT--A---AA-GG-GC-AC-------G-TA-G-G-C-G---G--T-GA-A-T-C---GG--G-T-T-G-------A-A-A--G--TG--A-AA-G---T-C-GC-C-A--------AA-A-AG--------T-G-G-C-G-G-A--A------T-G--C-T---C--T---------C--GA-A-A---C------C-A-A--TT--C-A-C-----------T-T-G-A-G-T--G----G-GA----CA-G-G----G-G-A-G---AG-T---------GG--A--ATT--------T-C-G-T-GT--GT-A-G-GG-GT--G-A-A-A--TC-C-AG-AA--AT-C-TA-C-G-A-A----G-G-A-AC-G-CC----AA--A--A--GC-GAA-G--G-C--A--G--C-T-C-T--CTG---G--GT-C-C----------------------C-T---A-C-C--GA--CG--C--T-G--A-GG--T-G-CG-A--AA-G-C---G-TG--GG-G--AG-C-A-AA-CA--GG
>UA2312000644142327
T--AC--GG-AG-GGT----GCA-A-G--C-G-T-T--AA-T-CGG-AA--TT-A--C-T--GG-GC--GT--A---AA-GC-GC-AC-------G-CA-G-G-C-G---G--T-CT-G-T-C---AA--G-T-C-G-------G-A-T--G--TG--A-AA-TC--C-C-CG-G-G--------CT-C-AA--------C-C-T-G-G-G-A--A-C----T-G--C-A---T--T---------C--GA-A-A---C------T-G-G--CA--G-G-C-----------T-A-G-A-G-T--C----T-TG----TA-G-A----G-G-G-G---GG-T---------AG--A--ATT--------C-C-A-G-GT--GT-A-G-CA-GT--G-A-A-A--TG-C-GT-AG--AG-A-TC-T-G-G-A----G-G-A-AT-A-CC----GG--T--G--GC-GAA-G--G-C--G--G--C-C-C-C--CTG---G--AC-A-A----------------------A-G---A-C-T--GA--CG--C--T-C--A-TG--C-A-CG-A--AA-G-C---G-TG--GG-G--AG-C-A-AA-CA--GG
>UA231200064744279
T--AC--GG-AG-GGT----GCA-A-G--C-G-T-T--AA-T-CGG-AA--TT-A--C-T--GG-GC--GT--A---AA-GC-GC-AC-------G-CA-G-G-C-G---G--T-CT-G-T-C---AA--G-T-C-G-------G-A-T--G--TG--A-AA-TC--C-C-CG-G-G--------CT-C-AA--------C-C-T-G-G-G-A--A-C----T-G--C-A---T--T---------C--GA-A-A---C------T-G-G--CA--G-G-C-----------T-A-G-A-G-T--C----T-TG----TA-G-A----G-G-G-G---GG-T---------AG--A--ATT--------C-C-A-G-GT--GT-A-G-CG-GT--G-A-A-A--TG-C-GT-AG--AG-A-TC-T-G-G-A----G-G-A-AT-A-CC----GG--T--G--GC-GAA-G--G-C--G--G--C-C-C-C--CTG---G--AC-A-A----------------------A-G---A-C-T--GA--CG--C--T-C--A-GG--T-G-CG-A--AA-G-C---G-TG--GG-G--AG-C-A-AA-CA--GG
>UA231200064353574
T--AC--GG-AG-GGT----GCA-A-G--C-G-T-T--AA-T-CGG-AA--TT-A--C-T--GG-GC--GT--A---AA-GC-GC-AC-------G-CA-G-G-C-G---G--T-CT-G-T-T---AA--G-T-C-A-------G-A-T--G--TG--A-AA-TC--C-C-CG-G-G--------CT-T-AA--------C-C-T-G-G-G-A--A-C----T-G--C-A---T--T---------T--GA-A-A---C------T-G-G--CA--G-G-C-----------T-T-G-A-G-T--C----T-CG----TA-G-A----G-G-G-G---GG-T---------AG--A--ATT--------C-C-A-G-GT--GT-A-G-CG-GT--G-A-A-A--TG-C-GT-AG--AG-A-TC-T-G-G-A----G-G-A-AT-A-CC----GG--T--G--GC-GAA-G--G-C--G--G--C-C-C-C--CTG---G--AC-G-A----------------------A-G---A-C-T--GA--CG--C--T-C--A-GG--T-G-CG-A--AA-G-C---G-TG--GG-G--AG-C-A-AA-CA--GG

If you want to remove the gaps, you need to use degap.seqs. The gaps will remain after runnign filter.seqs because vertical=T only removes columns where every sequence has a gap in that position. The gaps you have in your output are not found in every sequence.

Pat

Hi Pat, thanks for that.
My concern is with many gaps represented by dashes in the resulting fasta after filter.seqs it means I had a very poorly alignment using the steps by SOP for v4 region against Silva 138.
When I applied trump=- in the filter.seqs nearly 99% of columns were removed.
How can I solve this? Running degap.seqs?
or should I align with other database like RDP?

Thanks,

Also, after concluding the analyses, I run picrust2 and it return to me a error message due the poorly alignment (represented by these gaps). Then, that is my another concern. I do know what to do …
now, I’m running precluster after degap.seqs and it does not conclude the command. It is taking a very long time and, eventually, crashes …

Curiously, after alignment it seems data overlapped acceptably.
pleae, see below
mothur > summary.seqs(fasta=current, count=luan.trim.contigs.good.count_table)
Using luan.trim.contigs.good.unique.align as input file for the fasta parameter.

Using 64 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 9438 3 0 1 1
2.5%-tile: 1 11550 253 0 4 27791
25%-tile: 1968 11550 253 0 5 277910
Median: 1968 11550 253 0 5 555819
75%-tile: 1968 11550 253 0 6 833728
97.5%-tile: 1968 11550 272 0 6 1083846
Maximum: 13406 13424 275 0 8 1111636
Mean: 1866 11563 254 0 5

of unique seqs: 156828

total # of seqs: 1111636

It took 6 secs to summarize 1111636 sequences.

Whatever you do, don’t use trump=-. Please follow the MiSeq SOP. You’ll need to run screen.seqs to remove the sequences that end early and start late and that are too short. I also wonder if you still have primers on some sequences that are >270 nt long.

If you need to remove gaps for another analysis, then use degap.seqs

Pat

Hi Pat,
yeah, definitely. I’ve already done the screen.seqs using as input 1968 start and 11550 end.
Even after running filter, unique, precluster and chimera, there are several dashes in the fasta file and it led to error in picrust2 (picrust considered it as poorly alignment and crashed).

Then I’ve run degap.seqs in the last fasta file to remove these dashes for running picrust2, considering only representant OTUs/ASVs
By here, it worked. Not sure if it carries into any issue in crossing within picrust2 algorithms.

Thank you!