Errors following align.seqs


I am new to metagenomics, and I am having a similar issue as that described in this past post. I am using mothur v. 1.48.0.

I am following the SOP, and I have run into no issues with creating contigs, screening out the sequences that have the incorrect length/ambiguities, and isolating the unique sequences.

I am trying to align my sequences to the SILVA 138.1 reference using:
mothur > align.seqs(fasta=shrub.trim.contigs.good.unique.fasta, reference=silva.nr_v138_1.pcr.fasta, processors=1)

, where silva.nr_v138_1.pcr.fasta is my reference file following the pcr.seqs command as described in the SOP (using mothur > pcr.seqs(fasta=silva.nr_v138_1, start=11895, end=25318, keepdots=F)).

After running this command and trying to see the summary of the aligned output, I keep receiving errors saying that the fasta file contains fewer reads than the original count table (about 600,000 fewer reads). This occurred initially when I was defaulting to 8 processors, and then it kept occurring after I changed it to 1 or 2 processors as was recommended in past posts.

Here is the code and output that I’m using/receiving:

mothur > align.seqs(fasta=shrub.trim.contigs.good.unique.fasta, reference=silva.nr_v138_1.pcr.fasta, processors=1)

“It took 3930 secs to align 952556 sequences.
[WARNING]: 81 of your sequences generated alignments that eliminated too many bases, a list is provided in /Volumes/NO NAME/RMultiflora/shrub.trim.contigs.good.unique.flip.accnos.
[NOTE]: 35 of your sequences were reversed to produce a better alignment.
It took 3930 seconds to align 952556 sequences.”


mothur > summary.seqs(fasta=shrub.trim.contigs.good.unique.align, count=shrub.trim.contigs.good.count_table, processors=8)

“[ERROR]: Your count file contains 952556 unique sequences, but your fasta file contains 318861. File mismatch detected, quitting command.”

(Just more information in case it’s relevant for diagnostics/troubleshooting.)
I noticed that my shrub.trim.contigs.good.unique.align file is 4.29GB, which is the same size as the original silva.nr_v138_1 file (4.28GB).
When I open the shrub.trim.contigs.good.unique.align_report file, I see a table with lots of rows of entries with the following columns: QueryName, QueryLength, TemplateName, TemplateLength… This file looks promising, but it may be missing entries if the error code is correct.

Any advice on what I should do? Let me know if there is any more information that I could provide to help.


A little more investigation into what I think might be happening.

After reading the README for the SILVA v138.1, I decided to re-run the align.seqs command with the silva.nr_v138_1.align file that I directly downloaded from your forum.

This time, the command aborted because the SILVA reference file was not aligned.

Here is what I saw:

mothur > align.seqs(fasta=shrub.trim.contigs.good.unique.fasta, reference=silva.nr_v138_1.align, processors=1)

Using 1 processors.

Reading in the /Volumes/NO NAME/RMultiflora/silva.nr_v138_1.align template sequences... [ERROR]: template is not aligned, aborting.


It took 135 to read 85702 sequences.

It reminded me that when I was running the align.seqs with my silva.nr_v138_1.pcr.fasta file that it was only using about 85,000 sequences instead of the 106,985 mentioned on the README. Could the misalignment of the SILVA file be removing sequences from the alignment procedure? If so, how do I remedy that? I noticed on the SILVA download page that it says that the reference file could be customized for alignments. Does this mean that it is not aligned as is?

Can you post the table that is outputted when running summary.seqs before you ran align.seqs? Also, can you post the output of running summary.seqs after you ran align.seqs but leave out the count file?


BEFORE align.seqs:
summary.seqs(fasta=shrub.trim.contigs.good.unique.fasta, count=shrub.trim.contigs.good.count_table)

AFTER align.seqs without count file:

Also received the following message:
[WARNING]: We found more than 25% of the bases in sequence M01676_4_000000000-LD84T_1_2106_18853_14047 to be ambiguous. Mothur is not setup to process protein sequences.

These seem consistent with the other errors. Is the count table not being rewritten/updated for some reason? Also, why does the align.seqs remove so many of my unique sequences?

Another thing that I noticed when looking back at what I had already done was that when I was trimming the SILVA reference file using the pcr.seqs command, I did not receive the same output message that I got when following the SOP with the older version of the SILVA reference.

What I got from the tutorial:

What I got for my own project:

(There was no “Output File Names” line after this, but there was an output file added to my directory. The next line in the Terminal was a mothur > command line.)

Can you try running align.seqs using the reference alignment that I used in the MiSeq SOP? I think something screwy is going on there since it appears that there is one very long sequence (or something else that’s weird)


align.seqs(fasta=shrub.trim.contigs.good.unique.fasta, reference=silva.bacteria.pcr.fasta, processors=2)

It took 1060 secs to align 952556 sequences.
[WARNING]: 182 of your sequences generated alignments that eliminated too many bases, a list is provided in /Volumes/NO NAME/RMultiflora/shrub.trim.contigs.good.unique.flip.accnos.
[NOTE]: 102 of your sequences were reversed to produce a better alignment.


[WARNING]: We found more than 25% of the bases in sequence M01676_4_000000000-LD84T_1_1101_5443_13119 to be ambiguous. Mothur is not setup to process protein sequences.

It looks like the same thing is happening with the removal of sequences, even with the same reference that I used throughout the SOP.

I figured it out!

When I had downloaded the SILVA v138.1 reference file from your forum, I then moved it to the directory I was working in (for my case, an external flashdrive). For some reason, when I went to unzip the files, the reference file was decompressed incorrectly such that it was only 4.6GB. I think that when I was aligning my sequences, a lot of the reference sequences were either missing or corrupted, which led to the cascade of errors that I encountered (the error message about the reference file not being aligned, the ambiguous sequence stuff, the removal of sequences from my data, etc.).

I decided to move my working directory to a Dropbox folder with more storage, and when I repeated the process to unzip the SILVA files, I ended up with one that was 7GB. Running summary.seqs on it also yielded the correct number of sequences. I went through the SOP again, and I made it past the alignment step, and everything is working as it should! The summary tables show the 900,000 unique sequences and the 2,500,000 total sequences. I don’t know why I encountered the same problems when I used the other reference file from the SOP. Maybe that has something to do with me switching to the Dropbox as well.

Thank you for being responsive to my concern! It was reassuring to know that you were there to look into things.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.