Use existing search database for classify.seqs with knn

pblumenk · November 11, 2015, 12:47pm

Dear mothur team,

i am using mothur with classify.seqs(method=knn) in a shell script with mothurs batch mode. This script is started many times in parallel with different data files but the same database. Unlike the wang method, k-nearest neighbor creates a new search database each time it is used, even if the correct 8mer- and tree.sum-files already exists. Unfortunately this creates a huge overhead and slows the classification down.

Is it possible to add a detection “if the files are already there, you only need to read them”?

Best regards,
Patrick

westcott · November 12, 2015, 2:25pm

The command shouldn’t be remaking the kmer file each time it runs the command, but it will say “Generating search database…” each time. The wang method has a lot more shortcut files to read. When you see the "Reading … " messages with the wang method, they are referring to these extra files.

One thing to consider when running mothur’s commands with shortcut files in batch mode in parallel…

If the commands are using the same references they will all be competing to generate the same shortcut files. This could cause unintended results. With the classify.seqs command several shortcut files are created for each set of reference files. For example, each instance of mothur will generate a referenceFileName.8mer file. The instances will compete to read, write, open and close the file. You can avoid file corruptions by running one instance of mothur with as many processors as you want to generate the shortcut files, and then you can run in batch and in parallel after that because mothur will only be reading the files. Note: you will have to do this with every new release because mothur forces a rebuild of shortcut files with a new release.

pblumenk · November 18, 2015, 11:30am

Thanks for the fast reply!

I’ve run the command three times in a row.

classify.seqs(fasta=SRR952153_1.trim.contigs.fasta, reference=~/Databases/Mothur/Silva/R119/silva.nr_v119.align, taxonomy=~/Databases/Mothur/Silva/R119/silva.nr_v119.tax, method=knn, numwanted=10, processors=8)

Normally the second and third run should be much faster than the first run, because of the already existing kmer and shortcut files, but all three times it run about 7 minutes. The timestamp of the kmer and shortcut file were also updated. The mothur version I’m using is 1.35.1.

westcott:

One thing to consider when running mothur’s commands with shortcut files in batch mode in parallel…

If the commands are using the same references they will all be competing to generate the same shortcut files. This could cause unintended results. With the classify.seqs command several shortcut files are created for each set of reference files. For example, each instance of mothur will generate a referenceFileName.8mer file. The instances will compete to read, write, open and close the file. You can avoid file corruptions by running one instance of mothur with as many processors as you want to generate the shortcut files, and then you can run in batch and in parallel after that because mothur will only be reading the files. Note: you will have to do this with every new release because mothur forces a rebuild of shortcut files with a new release.

Thanks for the advice! That’s actually what I’m doing, but I didn’t know that mothur rebuilds the files with each release.

Topic		Replies	Views
classify.seqs problem mothur bugs	1	1825	March 29, 2016
classify.seqs with own taxonomy reference files mothur bugs	10	16979	October 18, 2013
Any advantage in larger ksize and iters in classify.seqs? Commands in mothur	2	1314	March 29, 2016
classify.seqs mothur bugs	3	5003	July 30, 2014
Dist.seqs taking too much time Commands in mothur	4	446	May 15, 2022

Use existing search database for classify.seqs with knn

Related topics