Hi,
I am running a mothur analysis of 600,000 16S-454 sequences. For so many sequences, it is difficult to run the cluster command. It would take a very lo……ong time. I found this command cannot use multiple processors and causes memory exhausted. Is there some solutions for this case? Thanks for any advice on this.
Yao
Dear Pat,
Thanks for your reply.
I ran hcluster command, but there are some errors in the process (please refer to the following), and the output files only include the results at cutoff of unique.
Could you please give some suggestions to me?
Thanks, Yao
mothur > dist.seqs(fasta=SCS454seq.unique.filter.fasta, cutoff=0.03, processors=16) …
…
Output File Name:
SCS454seq.unique.filter.dist
It took 215569 to calculate the distances for 495640 sequences.
mothur > hcluster(column=SCS454seq.unique.filter.dist, name=SCS454seq.names, method=furthest)
[ERROR]: Could not open SCS454seq.unique.filter.sorted.dist.temp
It took 34779 seconds to sort.
[ERROR]: SCS454seq.unique.filter.sorted.dist is blank. Please correct.
changed cutoff to 10.005
Output File Names:
SCS454seq.unique.filter.fn.sabund
SCS454seq.unique.filter.fn.rabund
SCS454seq.unique.filter.fn.list
It took 1 seconds to cluster.
I would stay far away from hcluster. It also doesn’t look like you are following the SOP. There’s really little chance that you have 500k unique sequences if you are denoising your sequences, using screen.seqs (you aren’t), and filtering with trump=… I’d suggest following the SOP and see how it goes from there.
Pat