Use existing search database for classify.seqs with knn

Dear mothur team,

i am using mothur with classify.seqs(method=knn) in a shell script with mothurs batch mode. This script is started many times in parallel with different data files but the same database. Unlike the wang method, k-nearest neighbor creates a new search database each time it is used, even if the correct 8mer- and tree.sum-files already exists. Unfortunately this creates a huge overhead and slows the classification down.

Is it possible to add a detection “if the files are already there, you only need to read them”?

Best regards,

The command shouldn’t be remaking the kmer file each time it runs the command, but it will say “Generating search database…” each time. The wang method has a lot more shortcut files to read. When you see the "Reading … " messages with the wang method, they are referring to these extra files.

One thing to consider when running mothur’s commands with shortcut files in batch mode in parallel…

If the commands are using the same references they will all be competing to generate the same shortcut files. This could cause unintended results. With the classify.seqs command several shortcut files are created for each set of reference files. For example, each instance of mothur will generate a referenceFileName.8mer file. The instances will compete to read, write, open and close the file. You can avoid file corruptions by running one instance of mothur with as many processors as you want to generate the shortcut files, and then you can run in batch and in parallel after that because mothur will only be reading the files. Note: you will have to do this with every new release because mothur forces a rebuild of shortcut files with a new release.

Thanks for the fast reply!

I’ve run the command three times in a row.

classify.seqs(fasta=SRR952153_1.trim.contigs.fasta, reference=~/Databases/Mothur/Silva/R119/silva.nr_v119.align, taxonomy=~/Databases/Mothur/Silva/R119/, method=knn, numwanted=10, processors=8)

Normally the second and third run should be much faster than the first run, because of the already existing kmer and shortcut files, but all three times it run about 7 minutes. The timestamp of the kmer and shortcut file were also updated. The mothur version I’m using is 1.35.1.

Thanks for the advice! That’s actually what I’m doing, but I didn’t know that mothur rebuilds the files with each release.