not sure if its a bug, but definatly something that has changed between mothur versions v.1.27 and v.1.31 for the worse (for me).
We created our own reference taxonomy reference files for mothur based on SILVA Ref. 111 NR. At that time I was using mothur v.1.27 (for Windows 64bit) and our files worked fine when using the classify.seqs and classify.otu commands. Now I wanted to upgrade to v.1.31 but found out that our reference files donÂ´t work anymore with the new version. I tested everything under the same conditions that I used with v.1.27: same computer, same FASTA sample files to classify, same reference & taxonomy files, the only difference was the version of mothur. I used both bacteria and eukaryote samples and reference files for this and got the same problem.
When using the SILVA 111 bacteria reference files with v.1.31 I either stopped the process after a few hours or it did so by itself when it ran out of physical memory ([ERROR]: std::bad_alloc) - I saw in the Windows task manager that v.1.31. uses more physical memory for the same calculation than v.1.27. Generating the search database takes 3-4 min with v.1.31 and after that it is “Reading in the SSURef111bacteria.tax taxonomy…” forever. In v.1.27 the first step takes a split second and the whole analysis is done in around 5 min. I thought it might be the FASTA file size but the silva.bacteria.fasta file from the mothur website is twice the size of our “home-made” one and works without problem.
I could of course “mix and match” the mothur versions, run the classify commands in v.1.27 and the rest of the analysis in v.1.31 but would prefer to stick with the newest version throughout if possible. Thanks in advance for any help you might be able to give!
Could you send your reference files and taxonomy files to email@example.com so I can try to track down the problem for you?
thank you very much for your help, I sent you a link to my dropbox with the files as they are pretty big for an e-mail attachment.
I was able to run your files with our current version. Although the size of the template files are small, the number of sequences they contain is much larger than the silva reference files we use. ~300,000 vs. ~15,000. This will use significantly more memory and time to calculate the template probabilities. We also added a few extra checks to make sure that the files match which adds to the time to the reading of the taxonomy file. We will remove some of those in the next release to help with the processing time in large reference files like yours. Did you use multiple processors when running the command? The ([ERROR]: std::bad_alloc) indicates you are running out of RAM. I have 4G and ran it successfully with 1 processor, but it used most of my memory.
HI RenÃ©, Hi Sarah,
I am new using mothur, so I might be missing something really basic here. I want to create my own reference taxonomy (and associated files) based on my own curated aligmen. Could you guys tell me (or direct me to the link) how to do it?
I am very thankfull,
Here’s a link to the silva reference files on the mothur wiki, http://www.mothur.org/wiki/Silva_reference_files in case you want to look at an example. To run the classify.seqs command mothur expects a reference and taxonomy file. The reference file is in fasta format, http://www.mothur.org/wiki/Fasta_file, and the taxonomy file looks like, http://www.wiki.mothur.org/wiki/Taxonomy_File. Here are the common pitfalls to avoid when creating your own reference:
- The files must match, meaning if a sequence is in the reference file, mothur expects it in the taxonomy file and vise versa.
- The names of sequences in the taxonomy file don’t have the ‘>’ symbol in front of them like they do in the reference file.
- No spaces in the taxonomy. Lines like: seq1 Bacteria;Bacteroidetes 1;Bacteroidia; will fail because of the space after Bacteroidetes. You can run classify.seqs in debug mode to see what mothur is reading and help find errors like this.
Thank you so much! I will give a try. I thought that that would be a bit more complicated :).
If you wish to use the files we created, we are happy to share them. Of course, SILVA just released version 115, so our version 111 is now slightly outdated, but still gives better results than the old reference files on the mothur website.
Send me your e-mail through the mothur forum and I will send you a link to my dropbox where you can download the files.
I just dowloaded the newest version of mothur (V.1.32) and ran the same analysis again that caused me problems with version V.1.31 (see posts above). While it now works on my computer (even when it eats up all of my RAM during its run) and doesnÂ´t crash, it is still much slower compared to the earlier version of mothur. Classify.seqs in version 1.32 took around 2 h to classify 11000 sequences with the SILVA NR V111 eukaryotic taxonomy file that we created, while it took only around 5 min with version V.1.27 (same computer, only 1 processor, same fasta file, same taxonomy files, only the versions of mothur were different).
Do you see any chance of further improvements in mothur regarding speed of the classify commands, bringing it back to where it was in V.1.27?
Did you run it more than once? The first run of classify.seqs with a new version recreates all the shortcut files to account for any changes in the source code. I suspect your 1.27.0 version was using the prebuilt shortcut files.
thank you for the information. As you guessed correctly, I only tried it once before complaining here on the forum. However, now I repeated the same analysis three more times with mothur version 1.32, so that it should have had all the shortcut files from previous runs. Processing time went down - from the 2 hours of the first run to around 55 - 60 minutes in following runs - but is still far away from the 5 minutes it took for the same analysis in version 1.27.
I guess I will run the classify commands in the future with the old version and do the other analyses on the most recent one.