Processed killed run in batched mode

jrhaulung · January 14, 2021, 1:12am

Dear Mothur friends:

We try to analyze 16S data (illumina V3+V4, 97 samples, 20Gb, data) using a batch mode modified from the stability.batch showed in the Miseq SOP using a ubuntu server with dual cpu and 380gb RAM. The number of processor used in the analysis has currently been reduced to 28, but still got killed in the middle (not quite sure where it is, probably cluster.split). I will try to further reduce the number of processor or taxlevel to see how it will go. If I would like to used the data generated by those command before the one got killed to save time and avoid to run the batch file from start all over again, how should I modify the batch file to do so. Any suggestion to overcome the obstacle will be highly appreciated. The batched file and terminal information showed in the killed step are shown in bellow.

sincerely

Jrhau

REFERENCE_LOCATION=/media/mpiu/a93b0b36-e288-45ef-b21f-acc26e4b0af9/Bacteria-16S-Ref
ALIGNREF=silva.full_v138.fasta
TAXONREF_FASTA=trainset9_032012.pds.fasta
TAXONREF_TAX=trainset9_032012.pds.tax
CONTAMINENTS=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota
LOGNAME=20201026-trial2
DATA=/media/mpiu/a93b0b36-e288-45ef-b21f-acc26e4b0af9/tooth-16S/20201026-trial2
TYPE=fastq
PROC=28
#batch commands
set.logfile(name=$LOGNAME)
make.file(inputdir=$DATA, type=$TYPE, prefix=stability)
make.contigs(file=current, processors=$PROC)
screen.seqs(fasta=current, group=current, maxambig=0, maxlength=500)
unique.seqs()
count.seqs(name=current, group=current)
align.seqs(fasta=current, reference=$REFERENCE_LOCATION/$ALIGNREF)
# screen.seqs(fasta=current, count=current, start=6000, end=26000, maxhomop=8)
screen.seqs(fasta=current, count=current, start=6388, end=25316, maxhomop=8)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
pre.cluster(fasta=current, count=current, diffs=2)
chimera.vsearch(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current)
classify.seqs(fasta=current, count=current, reference=$REFERENCE_LOCATION/$TAXONREF_FASTA, taxonomy=$REFERENCE_LOCATION/$TAXONREF_TAX, cutoff=80)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=$CONTAMINENTS)
remove.groups(count=current, fasta=current, taxonomy=current, groups=Mock)
cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15)
make.shared(list=current, count=current, label=0.03)
classify.otu(list=current, count=current, taxonomy=current, label=0.03)
phylotype(taxonomy=current)
make.shared(list=current, count=current, label=1)
classify.otu(list=current, count=current, taxonomy=current, label=1)

Clustering /media/mpiu/a93b0b36-e288-45ef-b21f-acc26e4b0af9/tooth-16S/20201026-trial2/stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta.8.dist

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
1.14933e+08	2.70999e+06	1.24432e+06	2.34844e+06	0.979976	0.685326	0.989289	0.535738	0.989289	0.970366	0.5910170.984611	


tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
1.52984e+08	5.97733e+08	1.02162e+07	3.72478e+07	0.804198	0.983196	0.937401	0.94134	0.937401	0.940535	0.831814	0.865705	


Clustering /media/mpiu/a93b0b36-e288-45ef-b21f-acc26e4b0af9/tooth-16S/20201026-trial2/stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta.9.dist

Clustering /media/mpiu/a93b0b36-e288-45ef-b21f-acc26e4b0af9/tooth-16S/20201026-trial2/stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta.15.dist

Clustering /media/mpiu/a93b0b36-e288-45ef-b21f-acc26e4b0af9/tooth-16S/20201026-trial2/stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta.17.dist

tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
2.65246e+08	1.57723e+08	8.99204e+06	1.41069e+08	0.652808	0.946063	0.967211	0.527869	0.967211	0.738127	0.5445080.779501	

Killed

pschloss · January 14, 2021, 5:23pm

Hi there,

Your distance matrix is likely gigantic and is crashing your computer because it’s trying to use too much RAM. A couple of things to consider…

In cluster.split, use cutoff=0.03 and possibly taxlevel=5 or taxlevel=6
In pre.cluster you might use diffs=3 or diffs=4
You might want to check out this blogpost: Why do I have such a large distance matrix

The reason your distance matrix is so large is because you don’t have fully overlapping reads to sequence the V3-V4 region. Because you don’t have fully overlapping reads, you have suboptimal denoising and effectively nearly every sequence has an error in it increasing the number of unique sequences.

Pat

jrhaulung · January 14, 2021, 11:39pm

Dear Dr. Pschloss
Thank you so much for the detailed explanation.

sincerely,

Jrhau

system · January 24, 2021, 11:39pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
mothur does not complete batch script Commands in mothur	5	2037	June 8, 2016
Problems handling a >50 Gb distance matrix (cluster command) mothur bugs	12	14734	October 18, 2013
Command cluster_issue	17	922	November 28, 2021
Stuck at cluster.split -- how do I overcome RAM issue? Commands in mothur	12	12756	August 20, 2013
Processors used Theory behind mothur	7	10117	July 11, 2014

Processed killed run in batched mode

Related topics