Understanding performance of chimera.uchime

Chimera.uchime doesn’t seem to fully take advantage of multiple processes on my system – for some reason it seems to start the correct number of processes, then gradually use less and less processes, eventually decaying to the point where it is processing samples sequentially.

Here’s what that looks like, plotting the CPU usage over time:

I’m processing ~75 samples, with processors=32, using the fastest EC2 instance @ Amazon (cc2.8xlarge, Dual 8 core Xeons w/ Hyperthreading). I logged CPU usage with sar, and nothing else was running on the system. Unfortunately, understanding mothur’s code that handles this is a bit beyond me, but it appears there might be an easy performance win for large analyses by reworking how uchime processes are started. The times plotted are real, it takes over 4 hours to chimera check the project I’m working on (soil is pretty crazy), but it looks to me like it doesnt need to.

Id be happy to provide more information, if useful.

First off, there are many ways to parallelize code and what we perceive to be the fastest isn’t always possible because of how the algorithms are designed. If you had one sample and 100 processors, it would take just as long to run uchime in de novo mode as if you had 1 processor because it can’t be parallelized on its own. The parallelization that we developed puts each sample on a different processor. So take your case of 75 samples and 32 cores. 21 cores would each get 2 samples and 11 would each get 3 samples. Each core then processes their 2 or 3 samples. When each core finishes, the core ends. If all of your samples were the same, then 21 cores would end together and 50% later the other 11 cores would end. But because our samples aren’t identical, this doesn’t happen. So what you see is that the cores with the small samples end first and you get a stair step that roughly corresponds to 3% or 1/32.

Make sense?
Pat