mothur does not complete batch script

Hello, all,

I am trying to generate OTUs from soil samples, and I am running into problems that I have not had with previous datasets (mothur user since 2009). A quick overview:
V4 primers (515F+806R)
MiSeq with 600-cycle kit

Problem 1: Clustering hangs repeatedly at “dist.seqs” and “cluster”. The data set is pretty large, but I am using the Stampede supercomputer at the University of Texas https://portal.tacc.utexas.edu/user-guides/stampede through the XSEDE program. So, computational power should not be an issue. I saw another post that indicated datasets can be too large to cluster, but I am dealing with a much smaller number of unique sequences (~300K).

Problem 2: I was able to get past “dist.seqs” and “cluster” only once. When I did, only “unique” labels were generate for the OTUs. I read other posts in which the problem arose when Kozich and MiSeq protocol settings were not being followed. However, I am following these recommendations. The data seem to align well.

I posted a truncated version of a typical logfile. Output was trimmed to what I think is most important.

I read all posts on the mothur forum that I could find before posting this question. Any ideas would be greatly appreciated.

Jim

Linux version
Using ReadLine
Running 64Bit Version
mothur v.1.37.4
Last updated: 5/11/2016

mothur > make.contigs(file=soil.files, processors=16)
 It took 2107 secs to process 12154789 sequences.
 Total of all groups is 12154789

mothur > summary.seqs(fasta=current)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 1 244 244 0 3 1
 2.5%-tile: 1 252 252 0 4 303870
 25%-tile: 1 253 253 0 4 3038698
 Median:  1 253 253 1 5 6077395
 75%-tile: 1 253 253 1 5 9116092
 97.5%-tile: 1 330 330 14 8 11850920
 Maximum: 1 502 502 131 251 12154789
 Mean: 1 256.605 256.605 1.7951 4.8241
 # of Seqs: 12154789

mothur > screen.seqs(fasta=current, group=current, maxambig=0, maxlength=275)
 It took 412 secs to screen 12154789 sequences.

mothur > summary.seqs(fasta=current)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 1 249 249 0 3 1
 2.5%-tile: 1 253 253 0 4 143968
 25%-tile: 1 253 253 0 4 1439676
 Median:  1 253 253 0 5 2879352
 75%-tile: 1 253 253 0 5 4319027
 97.5%-tile: 1 254 254 0 8 5614735
 Maximum: 1 275 275 0 14 5758702
 Mean: 1 253.16 253.16 0 4.82025
 # of Seqs: 5758702

mothur > unique.seqs(fasta=current)
 5758702 1073811

mothur > count.seqs(name=current, group=current)
 It took 73 secs to create a table for 5758702 sequences.
 Total number of sequences: 5758702

mothur > summary.seqs(count=current)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 1 249 249 0 3 1
 2.5%-tile: 1 253 253 0 4 143968
 25%-tile: 1 253 253 0 4 1439676
 Median:  1 253 253 0 5 2879352
 75%-tile: 1 253 253 0 5 4319027
 97.5%-tile: 1 254 254 0 8 5614735
 Maximum: 1 275 275 0 14 5758702
 Mean: 1 253.16 253.16 0 4.82025
 # of unique seqs: 1073811
 total # of seqs: 5758702

mothur > pcr.seqs(fasta=silva.nr_v123.align, start=11000, end=26000, keepdots=F, processors=16)
 It took 67 secs to screen 172418 sequences.

mothur > system(mv silva.nr_v123.pcr.align silva.v4.align)

mothur > summary.seqs(fasta=silva.v4.align)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 796 10770 232 0 3 1
 2.5%-tile: 885 14517 316 0 4 4311
 25%-tile: 885 14517 317 0 4 43105
 Median:  885 14517 317 0 5 86210
 75%-tile: 885 14517 317 0 5 129314
 97.5%-tile: 885 14517 484 1 6 168108
 Maximum: 5119 14533 1146 5 16 172418
 Mean: 885.127 14516.9 332.899 0.0533703 4.75004
 # of Seqs: 172418

mothur > align.seqs(fasta=soil.trim.contigs.good.unique.fasta, reference=silva.v4.align)
 Reading in the silva.v4.align template sequences... DONE.
 It took 80 to read  172418 sequences.
 It took 1799 secs to align 1073811 sequences.

mothur > summary.seqs(fasta=current, count=current)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 0 0 0 0 1 1
 2.5%-tile: 2862 12444 253 0 4 143968
 25%-tile: 2862 12444 253 0 4 1439676
 Median:  2862 12444 253 0 5 2879352
 75%-tile: 2862 12444 253 0 5 4319027
 97.5%-tile: 2862 12444 254 0 8 5614735
 Maximum: 14517 14517 275 0 14 5758702
 Mean: 2879.83 12441.8 252.593 0 4.81339
 # of unique seqs: 1073811
 total # of seqs: 5758702

mothur > screen.seqs(fasta=current, count=current, summary=current, start=2862, end=12444, maxhomop=8)
 It took 413 secs to screen 1073811 sequences.

mothur > summary.seqs(fasta=current, count=current)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 885 12444 237 0 3 1
 2.5%-tile: 2862 12444 253 0 4 142629
 25%-tile: 2862 12444 253 0 4 1426285
 Median:  2862 12444 253 0 5 2852569
 75%-tile: 2862 12444 253 0 5 4278853
 97.5%-tile: 2862 12444 254 0 8 5562509
 Maximum: 2862 14517 275 0 8 5705137
 Mean: 2861.63 12444.7 253.144 0 4.81663
 # of unique seqs: 1037708
 total # of seqs: 5705137

mothur > filter.seqs(fasta=current, vertical=T, trump=.)
 Length of filtered alignment: 748
 Number of columns removed: 14252
 Length of the original alignment: 15000
 Number of sequences used to construct filter: 1037708

mothur > unique.seqs(fasta=current, count=current)
 1037708 1036740

mothur > summary.seqs(fasta=current, count=current)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 1 747 217 0 3 1
 2.5%-tile: 1 748 253 0 4 142629
 25%-tile: 1 748 253 0 4 1426285
 Median:  1 748 253 0 5 2852569
 75%-tile: 1 748 253 0 5 4278853
 97.5%-tile: 1 748 254 0 8 5562509
 Maximum: 4 748 275 0 8 5705137
 Mean: 1 748 253.132 0 4.81659
 # of unique seqs: 1036740
 total # of seqs: 5705137

mothur > pre.cluster(fasta=current, count=current, diffs=2)
 It took 808 secs to run pre.cluster.

mothur > chimera.uchime(fasta=current, count=current, dereplicate=t)

mothur > remove.seqs(fasta=current, accnos=current)
 Removed 38854 sequences from your fasta file.

mothur > summary.seqs(fasta=current, count=current)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 1 747 217 0 3 1
 2.5%-tile: 1 748 253 0 4 140938
 25%-tile: 1 748 253 0 4 1409378
 Median:  1 748 253 0 5 2818756
 75%-tile: 1 748 253 0 5 4228133
 97.5%-tile: 1 748 254 0 8 5496573
 Maximum: 4 748 275 0 8 5637510
 Mean: 1 748 253.134 0 4.81448
 # of unique seqs: 299838
 total # of seqs: 5637510

mothur > classify.seqs(fasta=current, count=current, reference=silva.v4.align, taxonomy=silva.nr_v123.tax, cutoff=80)
 It took 86 seconds generate search database. 
 It took 214 seconds get probabilities. 
 It took 526 secs to classify 299838 sequences.
 It took 37 secs to create the summary file for 299838 sequences.

mothur > remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Mitochondria-Eukaryota)
 [NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.

mothur > summary.seqs(fasta=current, count=current)
   Start End NBases Ambigs Polymer NumSeqs
 Minimum: 1 747 217 0 3 1
 2.5%-tile: 1 748 253 0 4 140918
 25%-tile: 1 748 253 0 4 1409178
 Median:  1 748 253 0 5 2818355
 75%-tile: 1 748 253 0 5 4227532
 97.5%-tile: 1 748 254 0 8 5495791
 Maximum: 4 748 275 0 8 5636708
 Mean: 1 748 253.134 0 4.81407
 # of unique seqs: 299744
 total # of seqs: 5636708

mothur > dist.seqs(fasta=current, cutoff=0.15)
 It took 28536 seconds to calculate the distances for 299744 sequences.

mothur > cluster(column=current, count=current)
 ********************#****#****#****#****#****#****#****#****#****#****#
 Reading matrix:     ||||||||| (truncated the 309 pipe characters)

How big is your dist file? I’ve never been able to use dist.seqs on soil samples even 454 data is too diverse. I believe cluster.seq has to load the entire dist into memory so your issue isn’t going to be computing power as much as available ram. Try cluster.split instead. You could start with tax=4 but then check your largest dist and see if it’s still too large. I’ve been doing tax=6 on my server

If you’re trying to use 300PE sequence data then you should be reading this and asking for 250PE data…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/

kmitchell: Thank you for your suggestion. The cluster.split command got me through the clustering step and did so very quickly.

Hi, Pat:

Thank you for responding.

I was able to get through clustering with cluster.split, as kmitchell suggested (thanks, again), using this command:

cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.05)

My shared file was generated, but OTUs were generated only at the 0.01 level. Your responses to previous posts indicated that this was due to poor alignment due to the hypervariable region chosen for libraries. However, we used the V4 region and the vast majority of the seqs that made it to the alignment step aligned cleanly. Here are the seqs before the above command:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 747 217 0 3 1
2.5%-tile: 1 748 253 0 4 140918
25%-tile: 1 748 253 0 4 1409178
Median: 1 748 253 0 5 2818355
75%-tile: 1 748 253 0 5 4227532
97.5%-tile: 1 748 254 0 8 5495791
Maximum: 4 748 275 0 8 5636708
Mean: 1 748 253.134 0 4.81407

of unique seqs: 299744

total # of seqs: 5636708

Do you have a suggestion for clustering at a 0.03 distance? I have used mothur for all of my previous studies but have never run into this problem before.

jim

Your cutoff is too low, change that to 0.15. Cutoff determines what distances will be stored. Its not the same thing as making shared file with just 0.05