cd-hit like clustering of fragments

Dear mothur developers,

in very large datasets containing metagenomic reads from functional genes, we often observe redundancy that cannot be reduced very much by unique.seqs and pre.cluster. The read lengths differ after removal of low-quality regions, and only cd-hit like clustering (grouping fragments to longer sequences containing them) can reduce the number of sequences to an amount that can be reasonably processed, aligned and clustered by mothur.

Therefore we suggest an improvement of the pre.cluster method, to be able to group sequence fragments as well. It could work like cd-hit, so first sorting the sequences by length and subsequently grouping shorter sequences having enough global identity. Such an improvement would be great as it would replace our currect external tools necessary for executing cd-hit and maintaining the names files.

Thanks a lot, best regards

Thomas Rattei
Department of Computational Systems Biology
University of Vienna