Have You Faced Issues Managing Large Data Sets in Mothur?

I recently started using Mothur as a tool for processing and analyzing microbial data. As someone who’s relatively new to this kind of work, I’ve been diving into it with enthusiasm, but I’ve hit a specific roadblock I need advice on.

To give you a bit of context: I’m working on a dataset that includes over a million sequences. Everything seemed straightforward during the initial preprocessing steps. However, when it came to clustering and calculating diversity indices, things started to get messy. The commands I’m using feel correct, but I’ve noticed that my computer slows down significantly during these processes. It sometimes takes hours or even days to complete certain steps. I’ve tried splitting the dataset into smaller subsets, but this approach feels cumbersome and introduces other challenges, such as ensuring consistency when merging the results back together.

I’ve been thinking about how to improve the way I handle this workflow and keep track of progress. Inspired by the idea of running trackers—specifically a real-time online running performance monitor—I wondered if there could be a similar tool or method for Mothur. Running trackers are tools that allow athletes to track their distance, time, and performance metrics in real-time, offering immediate feedback and the ability to adjust their routines. This concept made me think: what if there was something similar for monitoring Mothur workflows? A system that logs commands, captures outputs, and gives an overview of processing in real time would make troubleshooting and progress monitoring so much easier.

My question is, does Mothur have any features or tools that provide such a capability? If not, how do others keep track of their workflows and outputs efficiently?

Additionally, I’ve run into performance bottlenecks due to my hardware, which is a mid-range laptop. Has anyone else experienced similar slowdowns when working with large datasets? If so, how have you managed this? Are there specific configurations or alternative methods that you’d recommend?

Lastly, I’ve been using the default parameters for commands like cluster. Are there any adjustments to these settings that might help with processing speed while maintaining accuracy? I’d love to hear your insights and how you approach similar challenges.

Hi there - I’d encourage you to make sure you are following the MiSeq SOP. That is how my lab uses mothur to analyze our data. You’ll see an alternative to cluster, cluster.split, which helps with run times.

Also, you haven’t specified what you are trying to analyze, but I suspect it isn’t V4. Often people run into very similar problems as you are describing when they try sequencing V3-V4 or V4-V5 or use the 300PE chemistry. None of that works. As we showed in the original Kozich paper (and is still true today) the reads need to fully overlap to get complete denoising of your data. I’d encourage you to read this blogpost from 10 years back that many people on this forum have found still rings true to their experience.

Finally, I’d encourage you to use a high performance computer at your institution or to use something like AWS for your analysis. This will be far cheaper and more flexible than what you can do with your laptop.

Pat