faster OTU clustering

Hi,

I love MOTHUR (Thought about getting a tattoo saying these exact words). The only thing that is killing me at the moment is the clustering into OTUs. It is just no feasible with the cluster command to cluster >50,000 reads. Uclust is so much faster but I am stuck with their output. Looks like a happy Perl file format conversion orgy ahead of me.

hcluster(column=4.TCA.454Reads.trim.unique.filter.dist,name=4.TCA.454Reads.trim.names,method=average)

I tried both, cluster and hcluster. Job is running for days now on ~40,000 unique reads

Cheers,
Fabian

Sorry, will try cluster.split before complaining again.

This is a longer monologue

I have tried cluster.split:

mothur > cluster.split(column=2.TCA.454Reads.trim.unique.pick.filter.dist,name=2.TCA.454Reads.trim.pick.names,method=average)
Splitting the file…
It took 1288 seconds to split the distance file.

Reading 2.TCA.454Reads.trim.unique.pick.filter.dist.0.temp
********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||


Clustering 2.TCA.454Reads.trim.unique.pick.filter.dist.0.temp
It took 141830 seconds to cluster
Merging the clustered files…
It took 31 seconds to merge.

Output File Names:
2.TCA.454Reads.trim.unique.pick.filter.an.sabund
2.TCA.454Reads.trim.unique.pick.filter.an.rabund
2.TCA.454Reads.trim.unique.pick.filter.an.list

Still takes about 40h. Uclust takes in the order of 10 minutes

Sorry, our documentation is pretty poor at the moment. The way you are running cluster.split there isn’t actually any splitting going on. Try one of the following options…

cluster.split(column=2.TCA.454Reads.trim.unique.pick.filter.dist,name=2.TCA.454Reads.trim.pick.names,method=average, cutoff=0.25)
cluster.split(column=2.TCA.454Reads.trim.unique.pick.filter.dist,name=2.TCA.454Reads.trim.pick.names,method=average, taxonomy=2.TCA.454Reads.trim.pick.taxonomy, taxlevel=3)
cluster.split(taxonomy=2.TCA.454Reads.trim.pick.taxonomy, fasta=2.TCA.454Reads.trim.pick.fasta, name=2.TCA.454Reads.trim.pick.names,method=average, taxlevel=3)

With each of these, you will probably want the processors option to make use of the parallelization. The final option is the fastest and was the method that was most comparable to Uclust for speed in the AEM paper we published at the beginning of the summer.

Pat

Thanks for the quick reply, will try today!