Dear Mothur Community,
I’m pretty new to mothur and I’m running 48 samples collected from lake sediments and water column (Illumina Novaseq 2*250 bp with Arch519_Bac785 primers). I have successfully finished a demo (2 samples), but have run into troubles of getting too many unique sequences (967K) after pre.cluster. I’m showing all my command lines and the log files here:
make.contigs(file=16S.paired.files, processors=8)
summary.seqs(fasta=current)
screen.seqs(fasta=current, group=current, maxambig=0, maxlength=314)
summary.seqs(fasta=current)
unique.seqs(fasta=current)
count.seqs(name=current, group=current)
summary.seqs(count=current)
align.seqs(fasta=current, template=silva.nr_v138.align)
summary.seqs(fasta=current)
screen.seqs(fasta=current, count=current, summary=current, start=13129, end=25316, maxhomop=8)
summary.seqs(fasta=current, count=current)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
summary.seqs(fasta=current)
pre.cluster(fasta=current, count=current, diffs=2)
chimera.vsearch(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current)
summary.seqs(fasta=current, count=current)
And for some summary.seqs results about the unique sequences:
after make.contigs:
mothur > summary.seqs(fasta=current)
Using 16S.paired.trim.contigs.fasta as input file for the fasta parameter.
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 286 286 0 4 2502656
25%-tile: 1 287 287 0 4 25026556
Median: 1 287 287 0 4 50053112
75%-tile: 1 287 287 0 5 75079667
97.5%-tile: 1 288 288 2 6 97603567
Maximum: 1 500 500 103 250 100106222
Mean: 1 288 288 0 6
# of Seqs: 100106222
after unique.seqs:
mothur > unique.seqs(fasta=current)
Using 16S.paired.trim.contigs.good.fasta as input file for the fasta parameter.
93763103 23292848
Output File Names:
16S.paired.trim.contigs.good.names
16S.paired.trim.contigs.good.unique.fasta
mothur > count.seqs(name=current, group=current)
Using 16S.paired.contigs.good.groups as input file for the group parameter.
Using 16S.paired.trim.contigs.good.names as input file for the name parameter.
It took 1128 secs to create a table for 93763103 sequences.
Total number of sequences: 93763103
Output File Names:
16S.paired.trim.contigs.good.count_table
mothur > summary.seqs(count=current)
Using 16S.paired.trim.contigs.good.count_table as input file for the count parameter.
Using 16S.paired.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 286 286 0 4 2344078
25%-tile: 1 287 287 0 4 23440776
Median: 1 287 287 0 4 46881552
75%-tile: 1 287 287 0 5 70322328
97.5%-tile: 1 288 288 0 6 91419026
Maximum: 1 314 314 0 189 93763103
Mean: 1 286 286 0 4
# of unique seqs: 23292848
total # of seqs: 93763103
It took 414 secs to summarize 93763103 sequences.
after align.seqs:
mothur > summary.seqs(fasta=16S.paired.trim.contigs.good.unique.align, count=16S.paired.trim.contigs.good.count_table)
Using 72 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 13129 25316 286 0 4 2344078
25%-tile: 13129 25316 287 0 4 23440776
Median: 13129 25316 287 0 4 46881552
75%-tile: 13129 25316 287 0 5 70322328
97.5%-tile: 13129 25316 288 0 6 91419026
Maximum: 43116 43116 314 0 19 93763103
Mean: 13128 25293 286 0 4
# of unique seqs: 23292848
total # of seqs: 93763103
It took 3906 secs to summarize 93763103 sequences.
after screen.seqs(fasta=current, count=current, summary=current, start=13129, end=25316, maxhomop=8):
mothur > summary.seqs(fasta=current, count=current)
Using 16S.paired.trim.contigs.good.good.count_table as input file for the count parameter.
Using 16S.paired.trim.contigs.good.unique.good.align as input file for the fasta parameter.
Using 72 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 10241 25316 259 0 3 1
2.5%-tile: 13129 25316 286 0 4 2307844
25%-tile: 13129 25316 287 0 4 23078431
Median: 13129 25316 287 0 4 46156861
75%-tile: 13129 25316 287 0 5 69235291
97.5%-tile: 13129 25316 288 0 6 90005878
Maximum: 13129 26169 314 0 8 92313721
Mean: 13128 25316 286 0 4
# of unique seqs: 22773596
total # of seqs: 92313721
after filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current):
mothur > unique.seqs(fasta=16S.paired.trim.contigs.good.unique.good.filter.fasta, count=16S.paired.trim.contigs.good.good.count_table)
22773596 22757423
after pre.cluster(fasta=current, count=current, diffs=2)
chimera.vsearch(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current):
mothur > summary.seqs(fasta=16S.paired.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=16S.paired.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table)
Using 72 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 928 253 0 3 1
2.5%-tile: 1 933 286 0 4 2307844
25%-tile: 1 933 287 0 4 23078431
Median: 1 933 287 0 4 46156861
75%-tile: 1 933 287 0 5 69235291
97.5%-tile: 1 933 288 0 6 90005878
Maximum: 2 933 314 0 8 92313721
Mean: 1 932 287 0 4
# of unique seqs: 9671688
total # of seqs: 92313721
It took 574 secs to summarize 92313721 sequences.
I think these many sequences are likely to cause some problems for the cluster.split function… I have read the posts about what can make the large distance matrix file, but I’m not sure if I’m able to resequence the samples… Any thoughts and bits of help would be pretty useful! Thanks.