Hey everyone,
I’ve been having some recent trouble with the cluster.split command and didn’t know if anyone could point me in the right direction. Just to give you some background, the V4 region was targeted using MiSeq. I’ve been pushing this command for 164 hours at 500G of memory on a high processing computing core, and the command will not complete. However, here’s what’s really perplexing me. I’m running this command on just a subset of sequencing data from the MiSeq run (about 1/3 of the sequence data), as this data augments a separate research project from the other 2/3 of the run. I’ve run the same protocol on the other 2/3 of the data, and everything works out fine (using about 250G of memory and maybe 100 hours, if my memory serves me). However, there is only one difference I can think of between each data set. The larger (successful) sequence data was generated through targeting the V4 region in pure DNA extract (wetland soil). However, with the smaller (failing) sequence set, the full 16S gene was targeted in PCR (lake water samples), then the V4 region of the amplicons was targeted with sequencing primers. Could this influence the issues I’m experiencing? What I really am attempting to determine is if this is a command issue, or a data quality issue.
Below, I’ve plastered some script from the mothur log where the “cluster.split” command is failing. The command seems to run fine, then it reaches a certain “temp” file, and doesn’t move past the “cutoff selection” stage, and rests there eternally. I’ve tried multiple different clustering settings, such as setting taxlevel = 4 or 5, splitmethod = fasta or classify. The command setting below makes it furthest. Further below the failed command mothur log, I’ve listed summary files for both the successful run and failed run prior to running cluster.split. Any insight would be greatly appreciated :mrgreen:
Regards,
Dean
Using mothur v 1.35.1
cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=fasta, taxlevel=5, cutoff=0.15, processors=16)
…
Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.4.dist
Cutoff was 0.155 changed cutoff to 0.08
Cutoff was 0.155 changed cutoff to 0.08
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.2.dist
Cutoff was 0.155 changed cutoff to 0.07
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.13.dist
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.16.dist
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.1.dist
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.3.dist
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||
Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.8.dist
********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.7.dist
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
Summaries:
FAILED RUN
mothur > summary.seqs(fasta=REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta, count=REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, processors=8)
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 512 251 0 3 1
2.5%-tile: 5 512 253 0 4 57275
25%-tile: 5 512 253 0 4 572748
Median: 5 512 253 0 4 1145495
75%-tile: 5 512 253 0 5 1718242
97.5%-tile: 5 512 254 0 6 2233714
Maximum: 5 516 254 0 8 2290988
Mean: 4.99912 512.001 253.008 0 4.65046
of unique seqs: 413684
total # of seqs: 2290988
Output File Names:
REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.summary
It took 7 secs to summarize 2290988 sequences.
SUCCESSFUL RUN mothur > summary.seqs(fasta=GREATLAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta, count=GREATLAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 7 581 251 0 3 1
2.5%-tile: 12 581 253 0 3 183965
25%-tile: 12 581 253 0 4 1839641
Median: 12 581 253 0 4 3679282
75%-tile: 12 581 253 0 5 5518922
97.5%-tile: 12 581 254 0 6 7174598
Maximum: 12 595 254 0 8 7358562
Mean: 11.9994 581.002 253.043 0 4.59504
of unique seqs: 879423
total # of seqs: 7358562
Output File Names:
GREATLAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.summary
It took 17 secs to summarize 7358562 sequences.