Chimera.uchime doesn’t seem to fully take advantage of multiple processes on my system – for some reason it seems to start the correct number of processes, then gradually use less and less processes, eventually decaying to the point where it is processing samples sequentially.
Here’s what that looks like, plotting the CPU usage over time:
I’m processing ~75 samples, with processors=32, using the fastest EC2 instance @ Amazon (cc2.8xlarge, Dual 8 core Xeons w/ Hyperthreading). I logged CPU usage with sar, and nothing else was running on the system. Unfortunately, understanding mothur’s code that handles this is a bit beyond me, but it appears there might be an easy performance win for large analyses by reworking how uchime processes are started. The times plotted are real, it takes over 4 hours to chimera check the project I’m working on (soil is pretty crazy), but it looks to me like it doesnt need to.
Id be happy to provide more information, if useful.