align.seqs removes sequences from fasta file

Hi,

I am not sure if this has been asked before, but when I do align.seqs, a lot of sequences are removed from fasta file which creates discrepancy between fasta file and name/group files. For instance, if I run uchime downstream of this,
“[ERROR]: sequenceA is in your name file and not in your fasta file, please correct.”
comes up multiple times and creates errors.
NB these are not amplicons, I just picked out 16S rRNA gene and would like to classify them.

How can we avoid this, or is there a way around it?

I have posted the output as below;

mothur > summary.seqs(fasta=current, name=current)
Using 16S_pandaseq.unique.fasta as input file for the fasta parameter.
Using 16S_pandaseq.names as input file for the name parameter.

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 47 47 0 1 1
2.5%-tile: 1 88 88 0 3 707378
25%-tile: 1 111 111 0 4 7073777
Median: 1 138 138 0 4 14147554
75%-tile: 1 155 155 0 5 21221331
97.5%-tile: 1 168 168 0 6 27587730
Maximum: 1 170 170 3 86 28295107
Mean: 1 133.095 133.095 0.000118854 4.14909

of unique seqs: 10402292

total # of seqs: 28295107

Output File Names:
16S_pandaseq.unique.summary

It took 253 secs to summarize 28295107 sequences.

mothur > align.seqs(fasta=current, reference=/mothurfiles/silva.nr_v123.align, flip=T, processors=8)
Using 16S_pandaseq.unique.fasta as input file for the fasta parameter.

Using 8 processors.

Reading in the /mothurfiles/silva.nr_v123.align template sequences… DONE.
It took 264 to read 172418 sequences.
Aligning sequences from 16S_pandaseq.unique.fasta …
Some of you sequences generated alignments that eliminated too many bases, a list is provided in 16S_pandaseq.unique.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 45468 secs to align 10402292 sequences.


Output File Names: 16S_pandaseq.unique.align 16S_pandaseq.unique.align.report 16S_pandaseq.unique.flip.accnos
mothur > summary.seqs(fasta=current, name=current) Using 16S_pandaseq.unique.align as input file for the fasta parameter. Using 16S_pandaseq.names as input file for the name parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: -1 -1 0 0 1 1
2.5%-tile: 1046 1460 10 0 2 476999
25%-tile: 9815 14974 100 0 3 4769984
Median: 21926 26171 127 0 4 9539968
75%-tile: 34129 38347 152 0 5 14309952
97.5%-tile: 43021 43116 168 0 6 18602937
Maximum: 43116 43116 170 2 78 19079935
Mean: 21458.5 25246.9 121.296 8.24426e-05 4.06886

of unique seqs: 5161509

total # of seqs: 19079935

Output File Names:
16S_pandaseq.unique.summary

It took 3430 secs to summarize 19079935 sequences.

Thanks,

Sou

Hi

Which version number of mothur are you running? What type of computer is it running on? Please update to the most recent version.

Do you see any error messages when you run align.seqs?

Pat

Hi Pat,

I am running mothur v.1.35.1 on Linux. It might take a while for me to update to the most recent version since this is on a server which I have no admin right.

No error sign is produced at this stage, although if I run anything downstream that requires group/name files, error pops up because .align file is missing some sequences.

Thanks,

Sou

Hi,

You should be able to install mothur in your project directory without needing system-wide permissions. Can you also try to use 1 or 2 processors when running align.seqs?

Pat

Hi Pat,

Sorry for the delay in the reply.

I did download binary of v1.38.0 and tried it again but no difference.

Before align.seqs I had:

of unique seqs: 10402292

total # of seqs: 28295107

and after running align segs with 8 processors:

of unique seqs: 4883039

total # of seqs: 18459696

With 1 processor:

of unique seqs: 5644825

total # of seqs: 22727956

Command I ran is align.seqs(fasta=current, reference=/mothurfiles/silva.nr_v123.align, flip=T, processors=1)
where current is the correct fasta file (since I ran summary.seqs before and after alignment).

Thanks!

Could you send your log file and input files to mothur.bugs@gmail.com so I can track down the issue for you? Please reference this post.

Hi,

The input fasta file is almost 2GB. Can that be sent to the email?

Thanks,

Sou