Uchime processes


This is more an interest question than anything else. Due to having to use long (~450bp) sequences my uchime commands tend to take a long time to run (up to a month). Due to a problem with Miseq the sequencing centre I use put my samples on a Hiseq so I now have twice my normal amount of sequences. Uchime is therefore taking a VERY long time to run (>1 month).

Here is the question, I’ve noticed that while the process is still slow at first when I am getting output to the screen like this:

'00:00 18Mb 0.1% Reading 10118_projecttrim.contigs.good.unique.good.filter.un00:00 18Mb 0.1% Reading 10118_projecttrim.contigs.good.unique.good.filter.unique.precluster.temp22118.temp
WARNING: Ignoring gaps in FASTA file ‘10118_projecttrim.contigs.good.unique.good.filter.unique.precluster.temp22118.temp’
00:00 24Mb 100.0% Reading 10118_projecttrim.contigs.good.unique.good.filter.unique.precluster.temp22118.temp
00:00 24Mb 12.5k sequences
37:18 12Mb 100.0% 6188/12472 chimeras found (49.6%)

It took 2239 secs to check 12473 sequences from group TroutRTGEpooDOM.’

The screen output goes through all of the samples in this way and after about a week it stops delivering output to the screen.

However, the longest part of the process seems to be after this point. At this stage I get no new outputs to screen but the process doesn’t end. It just sits there. This part of the process is taking weeks and the sizes of my output files from uchime don’t seem to be changing from day to day over that time, making me feel like it isn’t really doing very much. It must be though as for the process my VIRT is 15.6g, RES is 12g and SHR is 80.

I know from the past that it will probably successfully finish and I don’t have a memory problem. I was just wondering exactly what the command is doing at this point if it has gone through the chimera detection stage with the samples and why it would take so long in comparison?

Thanks for your help,


Good Question, after the chimera detection process completes mothur looks at the dereplicate parameter. By default this is false, meaning if one sample finds a sequence to be chimeric then all samples should also find it chimeric. This involves parsing the results of the uchime program. I will add an feature request to our list to take a look at ways to improve the speed of this process. Thanks for bringing this to our attention, Sarah.

I ran the command with dereplicate=t as I am following the Miseq SOP. Doesn’t this mean it wouldn’t perform the parsing step?



Yes the parsing step would be skipped. If you are running the command with a count file, mothur will create a new count file with the samples where the sequences were found to be chimeric zeroed out. If any sequences are found to be chimeric in all samples then they are completely removed. Perhaps we can speed up this process as well.

So, my command was unfortunately accidentally killed by a server problem before it was able to complete this last step and it therefore did not create a count file. However, it had created the accnos file by this point so I ran remove.seqs using the fasta and count files created by the pre.cluster command. This appears to have successfully created a fasta and count file with the chimeras in the accnos file removed.

I was just wanting to check that people felt this was an OK thing to do? It just seems odd that if it was the creation of the count file which was taking so long for the uchime command then why would it take such a short amount of time to create one using the remove.seqs command? They are pretty much doing the same thing aren’t they?


I have also sequences little more than 400nt, and I’m waiting the Uchime command from 12h… :shock: …is that normal? I have 7 samples (Miseq, 2x300) and I’m following the SOP.

This is relevant reading for any analysis using 2x300 PE reads on MiSeq.