Hello! My first question and hopefully not my first dumb question.

In terms of theoretical processing time (and/or memory requirements?), what is the difference between using a reference library with 100,000 sequences versus using a reference library with 400,000 sequences?

Would it possibly take four times as long (e.g. a linear increase, 4<>1 = 4 times)?

Or, would it take sixteen times as long (e.g. a non-linear increase, 4<>1 = 4^2 = 16 times)?

Or, would it not really matter either way after the first run “hash tables” were made?

For example, if I reduce the MIDORI COI reference library (over 1 million sequences, but this includes many terrestrial/land stuff) to only contain “marine/oceanic species” (e.g. plankton, fish, algae), this would reduce the MIDORI library by more than half. I am wondering if cutting the library in half like this would reduce my matching/identification time two, four, or no-noticeable change.

I admit that I don’t fully understand the wizardry behind the matching math, but I would imagine it is probably non-linear?

(Thank you for any insight into this!)

Hello Dr. Schloss! Thank you for your reply.

I was repeatedly confused in my attempts to track down an answer because searching with the keyword “linear or non-linear” hits more on the mathematical identity (e.g. a “linear” regression) rather than the relationship with “n” in terms of actual calculation effort.

Even thinking of “n”-based performance of simple sorting or searching algorithms, I think they teach entire CompSci classes on the theoretical vs. real life performance of searching and sorting algorithms, and depending on the method and sorted/unsorted nature of the incoming data, these can be linear, non-linear, log, exponential, quadratic, or what-not, right?

Do you expect that the Mothur run-time *memory* requirements are also linear? If I double or half the reference database size, will that simply double or half the memory needs, or could memory needs be non-linear (e.g. 4-times more or less memory is needed) ?

I ask because I am a member of the MetaZooGene ocean barcoding working group (MetaZooGene.org), where we are attempting to assemble a marine organisms only reference database subset of COI/16S/18S/28S/12S data. We expect this will be 1/2 to 1/3 the size of other traditional reference databases (which also include huge amounts of terrestrial and pathogen sequences). Many of us are using Mothur. I am new to the field, but I truly appreciate the ability to quickly install and instantly run your software on either a laptop or a 16 processor AWS cloud instance. (One runs slightly faster than the other, of course …)

Thank you for your answers, and for this excellent software.

That should be linear too…

Pat