cluster.split V4 MiSeq runtime problem

Hey everyone,

I’ve been having some recent trouble with the cluster.split command and didn’t know if anyone could point me in the right direction. Just to give you some background, the V4 region was targeted using MiSeq. I’ve been pushing this command for 164 hours at 500G of memory on a high processing computing core, and the command will not complete. However, here’s what’s really perplexing me. I’m running this command on just a subset of sequencing data from the MiSeq run (about 1/3 of the sequence data), as this data augments a separate research project from the other 2/3 of the run. I’ve run the same protocol on the other 2/3 of the data, and everything works out fine (using about 250G of memory and maybe 100 hours, if my memory serves me). However, there is only one difference I can think of between each data set. The larger (successful) sequence data was generated through targeting the V4 region in pure DNA extract (wetland soil). However, with the smaller (failing) sequence set, the full 16S gene was targeted in PCR (lake water samples), then the V4 region of the amplicons was targeted with sequencing primers. Could this influence the issues I’m experiencing? What I really am attempting to determine is if this is a command issue, or a data quality issue.

Below, I’ve plastered some script from the mothur log where the “cluster.split” command is failing. The command seems to run fine, then it reaches a certain “temp” file, and doesn’t move past the “cutoff selection” stage, and rests there eternally. I’ve tried multiple different clustering settings, such as setting taxlevel = 4 or 5, splitmethod = fasta or classify. The command setting below makes it furthest. Further below the failed command mothur log, I’ve listed summary files for both the successful run and failed run prior to running cluster.split. Any insight would be greatly appreciated :mrgreen:

Regards,
Dean

Using mothur v 1.35.1

cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=fasta, taxlevel=5, cutoff=0.15, processors=16)

Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.4.dist
Cutoff was 0.155 changed cutoff to 0.08
Cutoff was 0.155 changed cutoff to 0.08
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.2.dist
Cutoff was 0.155 changed cutoff to 0.07
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.13.dist
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.16.dist
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.1.dist
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.3.dist
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.8.dist
********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||||


Clustering REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta.7.dist
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06
Cutoff was 0.155 changed cutoff to 0.06


Summaries:

FAILED RUN
mothur > summary.seqs(fasta=REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta, count=REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 512 251 0 3 1
2.5%-tile: 5 512 253 0 4 57275
25%-tile: 5 512 253 0 4 572748
Median: 5 512 253 0 4 1145495
75%-tile: 5 512 253 0 5 1718242
97.5%-tile: 5 512 254 0 6 2233714
Maximum: 5 516 254 0 8 2290988
Mean: 4.99912 512.001 253.008 0 4.65046

of unique seqs: 413684

total # of seqs: 2290988

Output File Names:
REULAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.summary

It took 7 secs to summarize 2290988 sequences.


SUCCESSFUL RUN mothur > summary.seqs(fasta=GREATLAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.fasta, count=GREATLAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 7 581 251 0 3 1
2.5%-tile: 12 581 253 0 3 183965
25%-tile: 12 581 253 0 4 1839641
Median: 12 581 253 0 4 3679282
75%-tile: 12 581 253 0 5 5518922
97.5%-tile: 12 581 254 0 6 7174598
Maximum: 12 595 254 0 8 7358562
Mean: 11.9994 581.002 253.043 0 4.59504

of unique seqs: 879423

total # of seqs: 7358562

Output File Names:
GREATLAKES.trim.contigs.good.unique.pick.good.filter.unique.precluster.pick.pick.summary

It took 17 secs to summarize 7358562 sequences.

Do you know which chemistry was used to generate the different datasets?

Pat

The sequencing kit used was MiSeq Reagent Kit v2 (500 cycle), catalog # MS-102-2003. The two data sets were generated on the same MiSeq run. All library prep was accomplished at the sequencing facility using the method described by Kozich JJ, et al., Appl Environ Microbiol. 2013 Sep;79(17):5112-20.

Dean

A couple things you might try…

  1. Use fewer processors on the clustering step. You might have two gigantic groups trying to get clustered at the same time killing your RAM
  2. Wait longer. For big jobs, it’s not totally uncommon to have to wait a week or two
  3. diffs=3 in precluster. With 250 nt reads, you’ll still be under the 3% threshold
  4. taxlevel=6 in cluster.split. Very few if any genera are more than 97% similar to each other

Pat

Thanks Pat, I’ll give your recommendations a shot, I hadn’t thought of approaching from those angles. Fingers crossed! I’ll let you know how it goes.

Thanks again,
Dean

Hi Pat,

It seems your advice worked! I used 8 processors, set diffs=3 during the precluster step, and set the taxlevel=6 and method=classify during clustering. Clustering completed in about 90hrs using this method for my particular data set. Hopefully this feed will help anyone having a similar issue in the future.

Thanks again,
Dean

Hi!
I also have a time-related issue with cluster.split command. As happened to Dean, the run is taking a considerable amount of hours (about 24 hour and still half of the process). The first step (calculation of distances matrices) was already completed and now the files are being read. The problem is that I inevitably had to put the computer to sleep in order to transport it (I did not shut it down), and now the analysis seems to have been killed, although it didn’t give any error message. I have put the computer to sleep (and thus pause the analysis) during other previous steps, and the analysis always continued after I restored the session.
Does anyone know the possible effects of pausing an analysis from a certain command? Is there any risk for the analysis to go wrong even though you don’t get any error or warning message?

Thanks in advance.

It could be a few things, but I suspect that you may have run out of RAM on your laptop. I’d suggest finding a computer that you can leave active for an extended period of time and find out how much RAM you have. Also, if you follow up, please create a new thread so that we try to keep things organized on the forum.

Pat