Mothur for large amount of data

As the ability of sequencing developing faster and faster, it is becoming more difficult for Mothur to handle the large amount of data.

For example, I use a sample with unique 70,000 reads. It will cost intolerant time and CPUs.

And it is unbelievable if I use several samples. As a result, the funtions cannot be used, such as rarefaction.shared, tree.shared, collect.shared.

Is there any approach to handle the large amount of data?

I feel your pain. We are working on a few improvements to the clustering algorithms to make them parallelized. However, I don’t believe that the number of uniques will scale with additional reads when one corrects for sequencing errors. I just processed about 10 million sequences down to 22,000 uniques, which was pretty simple to cluster into OTUs, etc. I’ll post more details soon, but the upshot is that the distal end of the sequences seem to accumulate errors. If you use the qthreshold option with a score of 20 and use pre-clustering, the sequencing error rate drops by more than an order of magnitude. As for the examples you cite, I’m surprised to hear that these are the rate-limiting step.

It should be obvious, but we are definitely in the realm where we can no longer do much of these analyses on a laptop or even garden variety desktop. If a lab is willing to spend $10k’s on sequencing, they should be willing to spend $10k on a computer.

I am happy to work with people to make sure that they have optimized their pipeline.


Thank you very much.

we pyrosequenced 4 different samples monitored along several months. We obtained more than 300,000 sequences in 172 fasta files. We don´t want to merge the replicates together because we consider them different samples. So this is going to be quite a lot of time to process these sequences. One of the questions I was asking myself was: is there any way to run the same command, (for example uchime.seqs) in more than one of those files at the same time?
Thank you!!

Have you run the practice MiSeq SOP? You can run the whole thing and keep track of which sequence comes from each group and do your analysis with that. Perhaps I’m missing what you’re trying to do.

Thank you! i´ll check it!

I know that i must take more time trying the SOP you provide, but anyway, my question was if it is possible to run:
mothur> chimera.uchime(fasta=A0.1.fasta, for example on A0.1.fasta; A0.2.fasta; A0.3.fasta;…at the same time in order to leave it running perhaps overnight and then get all the outputs files. A way of running the same command to multiple files not taking care of typing every command for the entire set of files (172 in our case).
Tell me that it is possible, please. I hope there is a simple way to do it.
Thank you!

Not exactly. If you have a group file or a count file where you have all of those groups together and a fasta and name file then you can run…

mothur > chimer.slayer(fasta=, name=, group=)


mothur > chimera.slayer(fasta=, count=)

That will check for chimeras in each sample. There are some tweaks you’d have to make based on whether you want a chimer flagged in one sample to be flagged in all samples.

For the way you’re doing it, the best thing would probably be to just set up a perl/python script to run mothur or to create a batch file that you could run through mothur. I’m still not clear why you’d want to do 172 files separately. Theoretically, you’d like to compare those to each other and if so, the data at some point need to be merged - it would be easiest to merge them way back at the beginning so you don’t have to worry about this stuff.