Hello, all,
I am trying to generate OTUs from soil samples, and I am running into problems that I have not had with previous datasets (mothur user since 2009). A quick overview:
V4 primers (515F+806R)
MiSeq with 600-cycle kit
Problem 1: Clustering hangs repeatedly at “dist.seqs” and “cluster”. The data set is pretty large, but I am using the Stampede supercomputer at the University of Texas https://portal.tacc.utexas.edu/user-guides/stampede through the XSEDE program. So, computational power should not be an issue. I saw another post that indicated datasets can be too large to cluster, but I am dealing with a much smaller number of unique sequences (~300K).
Problem 2: I was able to get past “dist.seqs” and “cluster” only once. When I did, only “unique” labels were generate for the OTUs. I read other posts in which the problem arose when Kozich and MiSeq protocol settings were not being followed. However, I am following these recommendations. The data seem to align well.
I posted a truncated version of a typical logfile. Output was trimmed to what I think is most important.
I read all posts on the mothur forum that I could find before posting this question. Any ideas would be greatly appreciated.
Jim
Linux version
Using ReadLine
Running 64Bit Version
mothur v.1.37.4
Last updated: 5/11/2016
mothur > make.contigs(file=soil.files, processors=16)
It took 2107 secs to process 12154789 sequences.
Total of all groups is 12154789
mothur > summary.seqs(fasta=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 244 244 0 3 1
2.5%-tile: 1 252 252 0 4 303870
25%-tile: 1 253 253 0 4 3038698
Median: 1 253 253 1 5 6077395
75%-tile: 1 253 253 1 5 9116092
97.5%-tile: 1 330 330 14 8 11850920
Maximum: 1 502 502 131 251 12154789
Mean: 1 256.605 256.605 1.7951 4.8241
# of Seqs: 12154789
mothur > screen.seqs(fasta=current, group=current, maxambig=0, maxlength=275)
It took 412 secs to screen 12154789 sequences.
mothur > summary.seqs(fasta=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 249 249 0 3 1
2.5%-tile: 1 253 253 0 4 143968
25%-tile: 1 253 253 0 4 1439676
Median: 1 253 253 0 5 2879352
75%-tile: 1 253 253 0 5 4319027
97.5%-tile: 1 254 254 0 8 5614735
Maximum: 1 275 275 0 14 5758702
Mean: 1 253.16 253.16 0 4.82025
# of Seqs: 5758702
mothur > unique.seqs(fasta=current)
5758702 1073811
mothur > count.seqs(name=current, group=current)
It took 73 secs to create a table for 5758702 sequences.
Total number of sequences: 5758702
mothur > summary.seqs(count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 249 249 0 3 1
2.5%-tile: 1 253 253 0 4 143968
25%-tile: 1 253 253 0 4 1439676
Median: 1 253 253 0 5 2879352
75%-tile: 1 253 253 0 5 4319027
97.5%-tile: 1 254 254 0 8 5614735
Maximum: 1 275 275 0 14 5758702
Mean: 1 253.16 253.16 0 4.82025
# of unique seqs: 1073811
total # of seqs: 5758702
mothur > pcr.seqs(fasta=silva.nr_v123.align, start=11000, end=26000, keepdots=F, processors=16)
It took 67 secs to screen 172418 sequences.
mothur > system(mv silva.nr_v123.pcr.align silva.v4.align)
mothur > summary.seqs(fasta=silva.v4.align)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 796 10770 232 0 3 1
2.5%-tile: 885 14517 316 0 4 4311
25%-tile: 885 14517 317 0 4 43105
Median: 885 14517 317 0 5 86210
75%-tile: 885 14517 317 0 5 129314
97.5%-tile: 885 14517 484 1 6 168108
Maximum: 5119 14533 1146 5 16 172418
Mean: 885.127 14516.9 332.899 0.0533703 4.75004
# of Seqs: 172418
mothur > align.seqs(fasta=soil.trim.contigs.good.unique.fasta, reference=silva.v4.align)
Reading in the silva.v4.align template sequences... DONE.
It took 80 to read 172418 sequences.
It took 1799 secs to align 1073811 sequences.
mothur > summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 2862 12444 253 0 4 143968
25%-tile: 2862 12444 253 0 4 1439676
Median: 2862 12444 253 0 5 2879352
75%-tile: 2862 12444 253 0 5 4319027
97.5%-tile: 2862 12444 254 0 8 5614735
Maximum: 14517 14517 275 0 14 5758702
Mean: 2879.83 12441.8 252.593 0 4.81339
# of unique seqs: 1073811
total # of seqs: 5758702
mothur > screen.seqs(fasta=current, count=current, summary=current, start=2862, end=12444, maxhomop=8)
It took 413 secs to screen 1073811 sequences.
mothur > summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 885 12444 237 0 3 1
2.5%-tile: 2862 12444 253 0 4 142629
25%-tile: 2862 12444 253 0 4 1426285
Median: 2862 12444 253 0 5 2852569
75%-tile: 2862 12444 253 0 5 4278853
97.5%-tile: 2862 12444 254 0 8 5562509
Maximum: 2862 14517 275 0 8 5705137
Mean: 2861.63 12444.7 253.144 0 4.81663
# of unique seqs: 1037708
total # of seqs: 5705137
mothur > filter.seqs(fasta=current, vertical=T, trump=.)
Length of filtered alignment: 748
Number of columns removed: 14252
Length of the original alignment: 15000
Number of sequences used to construct filter: 1037708
mothur > unique.seqs(fasta=current, count=current)
1037708 1036740
mothur > summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 747 217 0 3 1
2.5%-tile: 1 748 253 0 4 142629
25%-tile: 1 748 253 0 4 1426285
Median: 1 748 253 0 5 2852569
75%-tile: 1 748 253 0 5 4278853
97.5%-tile: 1 748 254 0 8 5562509
Maximum: 4 748 275 0 8 5705137
Mean: 1 748 253.132 0 4.81659
# of unique seqs: 1036740
total # of seqs: 5705137
mothur > pre.cluster(fasta=current, count=current, diffs=2)
It took 808 secs to run pre.cluster.
mothur > chimera.uchime(fasta=current, count=current, dereplicate=t)
mothur > remove.seqs(fasta=current, accnos=current)
Removed 38854 sequences from your fasta file.
mothur > summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 747 217 0 3 1
2.5%-tile: 1 748 253 0 4 140938
25%-tile: 1 748 253 0 4 1409378
Median: 1 748 253 0 5 2818756
75%-tile: 1 748 253 0 5 4228133
97.5%-tile: 1 748 254 0 8 5496573
Maximum: 4 748 275 0 8 5637510
Mean: 1 748 253.134 0 4.81448
# of unique seqs: 299838
total # of seqs: 5637510
mothur > classify.seqs(fasta=current, count=current, reference=silva.v4.align, taxonomy=silva.nr_v123.tax, cutoff=80)
It took 86 seconds generate search database.
It took 214 seconds get probabilities.
It took 526 secs to classify 299838 sequences.
It took 37 secs to create the summary file for 299838 sequences.
mothur > remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Mitochondria-Eukaryota)
[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.
mothur > summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 747 217 0 3 1
2.5%-tile: 1 748 253 0 4 140918
25%-tile: 1 748 253 0 4 1409178
Median: 1 748 253 0 5 2818355
75%-tile: 1 748 253 0 5 4227532
97.5%-tile: 1 748 254 0 8 5495791
Maximum: 4 748 275 0 8 5636708
Mean: 1 748 253.134 0 4.81407
# of unique seqs: 299744
total # of seqs: 5636708
mothur > dist.seqs(fasta=current, cutoff=0.15)
It took 28536 seconds to calculate the distances for 299744 sequences.
mothur > cluster(column=current, count=current)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix: ||||||||| (truncated the 309 pipe characters)