Classify.seqs Generating search database... takes forever

Hi,
I have used the reference file SILVA_132_LSURef_tax_silva_trunc.fasta (593 318 kb), and taxonomy file taxmap_slv_lsu_ref_132.txt (27 175 kb), to classify my sequences (processed successfully using mothur until this step), but the run stayed forever at

Using 8 processors.
Generating search database…

What could be the reason? Are the files too big? I tried to use get.lineage to pick the taxa of interest (Eukaryota-Bacteria; Cyanobateria), but the pick was only successful in the taxonomy file. I searched throughout the fasta file and there were these taxa. There was no use by deleting the “>” symbol in the fasta file either.

what platform (OS) are you using? I had that problem once with redhat and I was just calling several times the same process instead of distributing it…

Deleting the “>”? Why?

I tried 1.41.3 and 1.42.3, neither worked.

One earlier problem I came across was something like:

ACD298849847 found in the template and is not found in the taxonomy file

Some answer from the mothur forum suggested to remove “>”

After I removed “>”, there came the problem that is the one I posted here.

Forgot to mention that I’m using windows 10.

Can you try to upgrade to the most recent version of mothur? Those versions are quite old at this point. A few questions…

  • How many lines are in SILVA_132_LSURef_tax_silva_trunc.fasta (in windows I think you can get this by running find /c /v "" SILVA_132_LSURef_tax_silva_trunc.fasta from the command line)
  • How many lines are in taxmap_slv_lsu_ref_132.txt
  • Can you post the first few lines of both files?

Pat

Hi Pat,
There are 7484196 lines in the fasta file, and 198844 lines in the taxonomy file.
The first lines in both files:
AY224383.3948.6873 Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus cereus
GGUUAAGUUAGAAAGGGCGCACGGUGGAUGCCUUGACACUAGGAGUCGAUGAAGGACGGGACUAACGCCGAUAUGCUUCG
GGGAGCUGUAAGUAAGCUUUGAUCCGAAGAUUUCCGAAUGGGGAAACCCACCAUACGUAAUGGUAUGGUAUCCUUAUCUG
GAUUUCCGAAUGGGGAAACCCACCAUACGUAAUGGUAUGGUAUCCUUAUCUG

primaryAccession.start.stop path organism_name taxid
AY224379.2894.5819 Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus; Bacillus cereus 815
AY224380.2894.5819 Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus; Bacillus cereus 815
AY224381.2668.5593 Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus; Bacillus cereus 815

I have deleted the “>” in the fasta file. In the taxonomy file, the first column was originally three separate columns, with the primaryAccession column corresponds to e.g. AY224379, and start column corresponds to e.g. 2894, and stop column corresponds to e.g. 5819. I combined these columns. Because, before I did these, there was an error like AY224379.2894.5819 was in the template file but not in the taxonomy file.

A couple things stand out as problem…

  • You must have the > in a fasta file. That’s part of what makes it a fasta file.
  • You need to get rid of the “primaryAccession.start.stop path organism_name taxid” line in your taxonomy file
  • The second column of your taxonomy file cannot have spaces in it (e.g. “; Bacillus cereus 815”
  • The last character of the taxonomy file needs to be a “;”
  • What version of mothur are you using?

I would encourage you to follow the README for constructing the SILVA reference files to see how we generated the one we provide users so that you can adapt it for your data.

Pat

Hi Pate,
I did what you suggested. Now classify.seqs works. But get.lineage(fasta=SILVA_132_LSURef_tax_silva_trunc.fasta, taxonomy=taxmap_slv_lsu_ref_132.tx, taxon=Eukaryota-Bacteria;Cyanobacteria) still worked only with the taxonomy file, but not with the fasta file.

what version of mothur are you using?

I tried 1.41.3, 1.42.3 and 1.43

Can you try the latest version that was posted?

Hi Pat,
I tried the 1.44.1 version. Now everything works. Thanks for the help!

Yan

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.