faster OTU clustering

Fabian · October 4, 2011, 7:38pm

Hi,

I love MOTHUR (Thought about getting a tattoo saying these exact words). The only thing that is killing me at the moment is the clustering into OTUs. It is just no feasible with the cluster command to cluster >50,000 reads. Uclust is so much faster but I am stuck with their output. Looks like a happy Perl file format conversion orgy ahead of me.

hcluster(column=4.TCA.454Reads.trim.unique.filter.dist,name=4.TCA.454Reads.trim.names,method=average)

I tried both, cluster and hcluster. Job is running for days now on ~40,000 unique reads

Cheers,
Fabian

Fabian · October 4, 2011, 7:59pm

Sorry, will try cluster.split before complaining again.

Fabian · October 4, 2011, 8:10pm

This is a longer monologue

I have tried cluster.split:

mothur > cluster.split(column=2.TCA.454Reads.trim.unique.pick.filter.dist,name=2.TCA.454Reads.trim.pick.names,method=average)
Splitting the file…
It took 1288 seconds to split the distance file.

Reading 2.TCA.454Reads.trim.unique.pick.filter.dist.0.temp
********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||

Clustering 2.TCA.454Reads.trim.unique.pick.filter.dist.0.temp
It took 141830 seconds to cluster
Merging the clustered files…
It took 31 seconds to merge.

Output File Names:
2.TCA.454Reads.trim.unique.pick.filter.an.sabund
2.TCA.454Reads.trim.unique.pick.filter.an.rabund
2.TCA.454Reads.trim.unique.pick.filter.an.list

Still takes about 40h. Uclust takes in the order of 10 minutes

pschloss · October 5, 2011, 12:46pm

Sorry, our documentation is pretty poor at the moment. The way you are running cluster.split there isn’t actually any splitting going on. Try one of the following options…

cluster.split(column=2.TCA.454Reads.trim.unique.pick.filter.dist,name=2.TCA.454Reads.trim.pick.names,method=average, cutoff=0.25)

cluster.split(column=2.TCA.454Reads.trim.unique.pick.filter.dist,name=2.TCA.454Reads.trim.pick.names,method=average, taxonomy=2.TCA.454Reads.trim.pick.taxonomy, taxlevel=3)

cluster.split(taxonomy=2.TCA.454Reads.trim.pick.taxonomy, fasta=2.TCA.454Reads.trim.pick.fasta, name=2.TCA.454Reads.trim.pick.names,method=average, taxlevel=3)

With each of these, you will probably want the processors option to make use of the parallelization. The final option is the fastest and was the method that was most comparable to Uclust for speed in the AEM paper we published at the beginning of the summer.

Pat

Fabian · October 5, 2011, 3:16pm

Thanks for the quick reply, will try today!

Topic		Replies	Views
cluster.split method fasta or classify Commands in mothur	9	8272	October 30, 2012
cluster.split cluster=f, then what Commands in mothur	1	2482	February 5, 2013
cluster.split() too slow Commands in mothur	2	812	October 5, 2017
Average clustering Commands in mothur	1	2439	April 26, 2011
cluster.split killed Commands in mothur	1	2909	August 28, 2013

faster OTU clustering

Related topics