Too many unique sequences before cluster.seqs

Dear Mothur Community,

I’m pretty new to mothur and I’m running 48 samples collected from lake sediments and water column (Illumina Novaseq 2*250 bp with Arch519_Bac785 primers). I have successfully finished a demo (2 samples), but have run into troubles of getting too many unique sequences (967K) after pre.cluster. I’m showing all my command lines and the log files here:

make.contigs(file=16S.paired.files, processors=8)
summary.seqs(fasta=current)
screen.seqs(fasta=current, group=current, maxambig=0, maxlength=314)
summary.seqs(fasta=current)
unique.seqs(fasta=current)
count.seqs(name=current, group=current)
summary.seqs(count=current)
align.seqs(fasta=current, template=silva.nr_v138.align)
summary.seqs(fasta=current)
screen.seqs(fasta=current, count=current, summary=current, start=13129, end=25316, maxhomop=8)
summary.seqs(fasta=current, count=current)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
summary.seqs(fasta=current)
pre.cluster(fasta=current, count=current, diffs=2)
chimera.vsearch(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current)
summary.seqs(fasta=current, count=current)

And for some summary.seqs results about the unique sequences:
after make.contigs:

mothur > summary.seqs(fasta=current)
Using 16S.paired.trim.contigs.fasta as input file for the fasta parameter.

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	250	250	0	3	1
2.5%-tile:	1	286	286	0	4	2502656
25%-tile:	1	287	287	0	4	25026556
Median: 	1	287	287	0	4	50053112
75%-tile:	1	287	287	0	5	75079667
97.5%-tile:	1	288	288	2	6	97603567
Maximum:	1	500	500	103	250	100106222
Mean:	1	288	288	0	6
# of Seqs:	100106222

after unique.seqs:

mothur > unique.seqs(fasta=current)
Using 16S.paired.trim.contigs.good.fasta as input file for the fasta parameter.
93763103	23292848

Output File Names: 
16S.paired.trim.contigs.good.names
16S.paired.trim.contigs.good.unique.fasta


mothur > count.seqs(name=current, group=current)
Using 16S.paired.contigs.good.groups as input file for the group parameter.
Using 16S.paired.trim.contigs.good.names as input file for the name parameter.

It took 1128 secs to create a table for 93763103 sequences.

Total number of sequences: 93763103

Output File Names: 
16S.paired.trim.contigs.good.count_table


mothur > summary.seqs(count=current)
Using 16S.paired.trim.contigs.good.count_table as input file for the count parameter.
Using 16S.paired.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	250	250	0	3	1
2.5%-tile:	1	286	286	0	4	2344078
25%-tile:	1	287	287	0	4	23440776
Median: 	1	287	287	0	4	46881552
75%-tile:	1	287	287	0	5	70322328
97.5%-tile:	1	288	288	0	6	91419026
Maximum:	1	314	314	0	189	93763103
Mean:	1	286	286	0	4
# of unique seqs:	23292848
total # of seqs:	93763103

It took 414 secs to summarize 93763103 sequences.

after align.seqs:

mothur > summary.seqs(fasta=16S.paired.trim.contigs.good.unique.align, count=16S.paired.trim.contigs.good.count_table)

Using 72 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	0	0	0	0	1	1
2.5%-tile:	13129	25316	286	0	4	2344078
25%-tile:	13129	25316	287	0	4	23440776
Median: 	13129	25316	287	0	4	46881552
75%-tile:	13129	25316	287	0	5	70322328
97.5%-tile:	13129	25316	288	0	6	91419026
Maximum:	43116	43116	314	0	19	93763103
Mean:	13128	25293	286	0	4
# of unique seqs:	23292848
total # of seqs:	93763103

It took 3906 secs to summarize 93763103 sequences.

after screen.seqs(fasta=current, count=current, summary=current, start=13129, end=25316, maxhomop=8):
mothur > summary.seqs(fasta=current, count=current)
Using 16S.paired.trim.contigs.good.good.count_table as input file for the count parameter.
Using 16S.paired.trim.contigs.good.unique.good.align as input file for the fasta parameter.

Using 72 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	10241	25316	259	0	3	1
2.5%-tile:	13129	25316	286	0	4	2307844
25%-tile:	13129	25316	287	0	4	23078431
Median: 	13129	25316	287	0	4	46156861
75%-tile:	13129	25316	287	0	5	69235291
97.5%-tile:	13129	25316	288	0	6	90005878
Maximum:	13129	26169	314	0	8	92313721
Mean:	13128	25316	286	0	4
# of unique seqs:	22773596
total # of seqs:	92313721

after filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current):
mothur > unique.seqs(fasta=16S.paired.trim.contigs.good.unique.good.filter.fasta, count=16S.paired.trim.contigs.good.good.count_table)
22773596	22757423

after pre.cluster(fasta=current, count=current, diffs=2)
chimera.vsearch(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current):
mothur > summary.seqs(fasta=16S.paired.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=16S.paired.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table)

Using 72 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 928 253 0 3 1

2.5%-tile: 1 933 286 0 4 2307844

25%-tile: 1 933 287 0 4 23078431

Median: 1 933 287 0 4 46156861

75%-tile: 1 933 287 0 5 69235291

97.5%-tile: 1 933 288 0 6 90005878

Maximum: 2 933 314 0 8 92313721

Mean: 1 932 287 0 4

# of unique seqs: 9671688

total # of seqs: 92313721

It took 574 secs to summarize 92313721 sequences.

I think these many sequences are likely to cause some problems for the cluster.split function… I have read the posts about what can make the large distance matrix file, but I’m not sure if I’m able to resequence the samples… Any thoughts and bits of help would be pretty useful! Thanks.

Hi there,

I’m not sure what the error profile of NovaSeq looks like relative to MiSeq, but I could imagine it is worse. You seem to be resquencing your primers and perhaps barcodes. Both would cause you to have less overlap between your reads. If you still have the barcodes on your sequences, that would inflate the number of sequences by a factor of the number of samples in your dataset. You might double check that those parts of the sequences have been removed either during the sequencing process (i.e. are the 16S primers used as sequencing primers) or adding it to make.contigs using the oligos file option.

Give these a try and let us know how it goes,
Pat