suggestions for large files for uchime de novo

Hi Pat,
I am using mothur to chimera-screen MiSeq data. My plan is to run both uchime reference and uchime de novo and remove any sequences called chimeric by either method. I am running into a problem when I have samples with > 100,000 unique sequences. Although we are aiming for ~5,000 to 10,000 sequences per sample, we often have a handful of samples that have far more reads than the other samples, and when these samples reach >100,000, the run time for Uchime de novo becomes several days to weeks, depending on how large an individual sample gets. I have tried 8 processors and 20 processors, but without much decrease in time.

Can you give me any advice on how to increase the speed, or do you recommend just using the uchime reference alone instead.

Thank you,
Kathie Mihindukulasuriya

Are you using pre.cluster? If so, I suspect the problem is more fundamental to the quality of your data. Even if you were to get through chimera.uchime and emerge with samples that have 100k reads, it will take a long time to get through the clustering process. You might check this out…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Hi,
Thanks Pat. I am doing the precluster step and only working on unique reads. The data I have so far has not been optimal quality (still working out production bugs).

I worry that even with better data in the future, I may run into large files if the MiSeq behaves like the 454, where you sometimes get a lot more reads for a couple samples and that if those samples happen to have a long tail of real taxa, I may run into the same problem.

Thank you very much for your quick and helpful replies,
Kathie