I have tried to run cluster.split command but it never does the clustering part. I have also tried running it into two parts, but still facing issues (even when I run it on HPC).
These are the commands I ran:
chimera.vsearch(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.fasta, count= obesasth.trim.contigs.unique.good.filter.unique.precluster.count_table)
classify.seqs(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table, reference=trainset18_062020.pds.fasta, taxonomy=trainset18_062020.pds.tax)
remove.lineage(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table, taxonomy=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
summary.seqs(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.pick.fasta)
Here is my output for summary:
Using 40 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 632 220 0 3 1
2.5%-tile: 1 669 252 0 3 36094
25%-tile: 1 669 252 0 4 360931
Median: 1 669 253 0 4 721862
75%-tile: 1 669 253 0 5 1082792
97.5%-tile: 1 669 253 0 6 1407629
Maximum: 2 669 272 0 8 1443722
Mean: 1 668 252 0 4
(number) of Seqs: 1443722
It took 10 secs to summarize 1443722 sequences.
After this, I am running cluster.split
cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=4, cluster=f, processors=36)
Using 36 processors.
Splitting the file…
/******************************************/
Selecting sequences for group Enterobacterales (1 of 129)
Number of unique sequences: 153062
I get the final.0.count.temp and final.0.count.dist type of files, but in the second step of cluster.split, it just keeps going endlessly. What should I do?