How to run cluster.split on HPC

I am going to run the second step of cluster.split on my IBM HPC using the following script:

#!/bin/bash
#BSUB -R “rusage[mem=800GB]”
./mothur “#cluster.split(file=final.file, count= final.count_table, cutoff=0.03, processors=10)”

Here is the output of lshosts on my HPC: I can use hpc-cmp-01 and hpc-cmp-02 for my computing work.

HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
hpc-mgt-01 LINUXPP POWER8 250.0 20 128G 31.9G Yes (mg)
hpc-mgt-02 LINUXPP POWER8 250.0 20 128G 31.9G Yes (mg)
hpc-cmp-01 LINUXPP POWER8 250.0 20 1T 31.9G Yes ()
hpc-cmp-02 LINUXPP POWER8 250.0 20 1T 31.9G Yes ()
hpc-cmp-03 LINUXPP POWER8 250.0 40 314.5G 31.9G Yes ()
hpc-cmp-04 LINUXPP POWER8 250.0 40 314.5G 31.9G Yes ()

Hello ,this is what I am doing on my server. The module I am loading are those required to be able to call Mothur correcly within my working environment, it may change for you. I would use “current” in the batch file instead of the complete path + file name, it will save you errors. Hope it helps,

#!/bin/bash

#SBATCH --time=24:00:00
#SBATCH --account=def-myaccount
#SBATCH --mem=128000M
#SBATCH --mail-user=me@umontreal.ca
#SBATCH --cpus-per-task=32
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=REQUEUE
#SBATCH --mail-type=ALL
#SBATCH --output=projectname.out

cd $SCRATCH/path where my data are

module purge

module load gcc/9.3.0
module load mothur/1.47.0
module load vsearch/2.15.2

mothur myproject.batch

seff $SLURM_JOBID

sstat $SLURM_JOBID

800GB - That’s a lot of memory! What problem are you running into? Are you using slurm or pbs?

Pat

Hello, could you please clarify if it is mandatory to have vsearch executable while running the step cluster.split? I don’t have it in my HPC where I am running this particular step.

I am using IBM machine with LSF (just like PBS/SLURM) for resource management. I have already got all the “final.93.opti_mcc.list” type of files. The process is still running since the last 24 hours and I don’t know when is it going to finish. (Please note I don’t have vsearch executable in my folder.) Here is what I get about my job.

bjobs -l 1154

Job <1154>, User , Project , Status , Queue , Command <
#!/bin/bash;#BSUB -R “rusage[mem=800GB]”;./mothur “#cluste
r.split(file=final.file, count=final.count_table, cutoff=0
.03, processors=10)”>, Share group charged
Thu Mar 31 17:14:52: Submitted from host , CWD <$HOME/simplestat/cl
ustersplit>, Requested Resources <rusage[mem=819200.00]>;
Thu Mar 31 17:14:52: Started 1 Task(s) on Host(s) , Allocated 1 Slo
t(s) on Host(s) , Execution Home </home/ibm>,
Execution CWD </home/ibm/simplestat/clustersplit>;
Fri Apr 1 11:27:38: Resource usage collected.
The CPU time used is 148675 seconds.
MEM: 757.3 Gbytes; SWAP: 0 Mbytes; NTHREAD: 5
PGID: 148682; PIDs: 148682 148683 148687 148688

MEMORY USAGE:
MAX MEM: 757.3 Gbytes; AVG MEM: 574 Gbytes

GPFSIO DATA:
READ: ~0 bytes; WRITE: ~0 bytes

SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

RESOURCE REQUIREMENT DETAILS:
Combined: select[type == local] order[r15s:pg] rusage[mem=819200.00]
Effective: select[type == local] order[r15s:pg] rusage[mem=819200.00]

I use vsearch for chimera detection.

So the question: what is in your “final” files? Can you post the summary?

I have already used vsearch for chimera search (I performed that step on a different machine).

I suspect your problem is poor quality data. If your reads don’t fully overlap (e.g. 2x250 to sequence the V4 region) or if you had a bad sequence run you are likely to see results like you have.