Generating ASVs that all have a size of one

I am having as issue generating ASVs. I set my pre.cluster to allow for 4 differences (1 for every 100 bp). When generating the ASV files, I end up with a shared filed an taxonomy file with about 3.6 million ASVs each having the size of one making me believe there is an issue with pre.cluster because there are no grouped sequences. I am not sure why this is happening and would appreciate any help. Do I need to set the number of differences allowed in a sequence to a higher value?

mothur > pre.cluster(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.fasta ,
   count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/staility.contigs.good.good.count_table , diffs = 4)

After pre.cluster, I followed the SOP for chimera checking, classifying, and removing lineage.

mothur > make.shared(count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.pick.count_table )

mothur > classify.otu(list = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinfomatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.pick.asv.list , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.pick.count_table , taxonomy = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.rdp.wang.pick.taxonomy , label = ASV)

Are you sure you don’t get an error message when running pre.cluster? Your count_file seems to be missing a “b” in stability.

Can you run summary.seqs with the fasta and count_file generated from pre.cluster? What region are you sequencing?

Also… It’s generally not a great idea to provide absolute paths like you are and I suspect that if you’re running this off a flash drive that things are moving pretty slow for you

Pat

I did not get an error message. I copied that line from the terminal, and it does not always copy over the last word in the line correctly. The path should be correct.

Here is the summary.seqs:

mothur > summary.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.count_table )

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1192 421 0 3 1
2.5%-tile: 1 1194 440 0 4 92596
25%-tile: 1 1194 442 0 4 925954
Median: 1 1194 445 0 5 1851907
75%-tile: 1 1194 465 0 6 2777860
97.5%-tile: 1 1194 465 0 6 3611218
Maximum: 2 1194 466 0 8 3703813
Mean: 1 1193 451 0 5
# of unique seqs: 3703813
total # of seqs: 3703813

I should have included that in the original post. Sorry. I am looking at the non-recommendable V3-V4 region.

Thank you,
Joe

Thanks - can you post the output of summary.seqs from the input files going into pre.cluster?

Thank you. Here it is:

mothur > summary.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stbility.contigs.good.good.count_table )

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1192 421 0 3 1
2.5%-tile: 1 1194 440 0 4 92596
25%-tile: 1 1194 442 0 4 925954
Median: 1 1194 445 0 5 1851907
75%-tile: 1 1194 465 0 6 2777860
97.5%-tile: 1 1194 465 0 6 3611218
Maximum: 2 1194 466 0 8 3703813
Mean: 1 1193 451 0 5
# of unique seqs: 3703813
total # of seqs: 3703813

It appears that you don’t have any duplicate sequences and a large number of unique sequences. The large number of uniques is not weird given the region you sequenced, but the lack of any duplicates is. Did you run unique.seqs at some point in your workflow? I wonder if you possibly left the count file out of any steps where it should have been included. Can you post each of the commands you have run prior to this step?

Pat

I did not use any unique.seqs. My advisor suggested I skip these steps so we could look closer at abundance. Here were all the commands I did:

mothur > make.file(inputdir = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics, type=fastq, prefix = stability)

mothur > make.contigs(file = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.files )
mothur > summary.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.trim.contigs.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.contigs.count_tabe )

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 35 35 0 2 1

2.5%-tile: 1 35 35 0 4 325544

25%-tile: 1 440 440 0 5 3255437

Median: 1 445 445 18 6 6510874

75%-tile: 1 465 465 21 6 9766311

97.5%-tile: 1 466 466 38 35 12696204

Maximum: 1 602 602 183 301 13021747

Mean: 1 424 424 13 7

of unique seqs: 13021747

total # of seqs: 13021747

mothur > screen.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.trim.contigs.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.contigs.count_tabl , maxambig = 0, minlength = 200, maxlength = 466, maxhomop = 8)

mothur > summary.seqs(fasta = current, count = current)

Using /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.contigs.good.count_table as input file for the count parameter.

Using /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.trim.contigs.good.fasta as input file for the fasta parameter.

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 200 200 0 3 1

2.5%-tile: 1 288 288 0 4 103523

25%-tile: 1 441 441 0 5 1035221

Median: 1 442 442 0 5 2070442

75%-tile: 1 465 465 0 6 3105663

97.5%-tile: 1 465 465 0 6 4037361

Maximum: 1 466 466 0 8 4140883

Mean: 1 436 436 0 5

of unique seqs: 4140883

total # of seqs: 4140883

mothur > align.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.trim.contigs.good.fasta , reference = /Users/joehansen/Documents/USA/MicrobiomeProject/Bioinformatics/silva.bacteri/silva.bacteria.fasta )

mothur > summary.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.trim.contigs.good.align , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.contigs.goodcount_table )

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 0 0 0 0 1 1

2.5%-tile: 6388 25316 10 0 3 103523

25%-tile: 6388 25316 441 0 4 1035221

Median: 6388 25316 442 0 5 2070442

75%-tile: 6388 25316 465 0 6 3105663

97.5%-tile: 43061 43116 465 0 6 4037361

Maximum: 43116 43116 466 0 8 4140883

Mean: 9434 26631 411 0 5

of unique seqs: 4140883

total # of seqs: 4140883

It took 4324 secs to summarize 4140883 sequences.

mothur > screen.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.trim.contigs.good.align , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.contigs.good.ount_table , start = 6388, end = 25316)

mothur > summary.seqs(fasta = current, count = current)

Using /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.contigs.good.good.count_table as input file for the count parameter.

Using /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.trim.contigs.good.good.align as input file for the fasta parameter.

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 6099 25316 421 0 3 1

2.5%-tile: 6388 25316 440 0 4 92596

25%-tile: 6388 25316 442 0 4 925954

Median: 6388 25316 445 0 5 1851907

75%-tile: 6388 25316 465 0 6 2777860

97.5%-tile: 6388 25316 465 0 6 3611218

Maximum: 6388 26155 466 0 8 3703813

Mean: 6387 25316 451 0 5

of unique seqs: 3703813

total # of seqs: 3703813

mothur > filter.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/stability.trim.contigs.good.good.align , vertical = T, trump=.)

It took 17646 secs to filter 3703813 sequences.

Length of filtered alignment: 1194

Number of columns removed: 48806

Length of the original alignment: 50000

Number of sequences used to construct filter: 3703813

mothur > summary.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stbility.contigs.good.good.count_table )

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 1192 421 0 3 1

2.5%-tile: 1 1194 440 0 4 92596

25%-tile: 1 1194 442 0 4 925954

Median: 1 1194 445 0 5 1851907

75%-tile: 1 1194 465 0 6 2777860

97.5%-tile: 1 1194 465 0 6 3611218

Maximum: 2 1194 466 0 8 3703813

Mean: 1 1193 451 0 5

of unique seqs: 3703813

total # of seqs: 3703813

It took 109 secs to summarize 3703813 sequences.

mothur > pre.cluster(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/staility.contigs.good.good.count_table , diffs = 4)

mothur > chimera.vsearch(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinfomatics/ASV/stability.trim.contigs.good.good.filter.precluster.count_table , dereplicate = T)

mothur > summary.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformaics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.count_table )

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 1192 421 0 3 1

2.5%-tile: 1 1194 440 0 4 92596

25%-tile: 1 1194 442 0 4 925954

Median: 1 1194 445 0 5 1851907

75%-tile: 1 1194 465 0 6 2777860

97.5%-tile: 1 1194 465 0 6 3611218

Maximum: 2 1194 466 0 8 3703813

Mean: 1 1193 451 0 5

of unique seqs: 3703813

total # of seqs: 3703813

It took 109 secs to summarize 3703813 sequences.

mothur > classify.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.count_table , reference = /Users/joehansen/Documents/USA/MicrobiomeProject/Bioinformatics/trainset18_062020.rdp/trainset18_062020.rdp.fasta , taxonomy = /Users/joehansen/Documents/USA/MicrobiomeProject/Bioinformatics/trainset18_062020.rdp/trainset18_062020.rdp.tax )

mothur > remove.lineage(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinforatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.count_table, taxonomy = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.rdp.wang.taxonomy , taxon = Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

mothur > summary.seqs(fasta = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.pick.fasta , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinormatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.pick.count_table )

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 1192 421 0 3 1

2.5%-tile: 1 1194 440 0 4 89648

25%-tile: 1 1194 442 0 4 896473

Median: 1 1194 445 0 5 1792946

75%-tile: 1 1194 465 0 6 2689418

97.5%-tile: 1 1194 465 0 6 3496243

Maximum: 2 1194 466 0 8 3585890

Mean: 1 1193 452 0 5

of unique seqs: 3585890

total # of seqs: 3585890

mothur > summary.tax(taxonomy = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.rdp.wang.pick.taxonomy , count = /Volumes/FlashDrive/USA/MicrobiomProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.pick.count_table )
It took 103 secs to create the summary file for 3585890 sequences.

mothur > make.shared(count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.pick.count_table )

mothur > classify.otu(list = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinfomatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.pick.asv.list , count = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.denovo.vsearch.pick.count_table , taxonomy = /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/ASV/stability.trim.contigs.good.good.filter.precluster.rdp.wang.pick.taxonomy , label = ASV)

Thanks - can you run unique.seqs right after running make.contigs? It shouldn’t do anything to the abundances. If you could post the output of running summary.seqs after unique.seqs that would be helpful

Pat

Hi Pat,

Here is the summary after running unique.seqs right after make.contigs:

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 35 35 0 2 1

2.5%-tile: 1 35 35 0 4 325544

25%-tile: 1 440 440 0 5 3255437

Median: 1 445 445 18 6 6510874

75%-tile: 1 465 465 21 6 9766311

97.5%-tile: 1 466 466 38 35 12696204

Maximum: 1 602 602 183 301 13021747

Mean: 1 424 424 13 7

of unique seqs: 7774725

total # of seqs: 13021747

Let me know if this helps.

Thank you,
Joe

Well that’s progress - can you take this fasta and count file through the rest of the steps in your pipeline and see what happens?
Pat

Hi Pat,

I went through the rest of the pipeline. There are now 367,139 ASVs which is the number of unique sequences. It seems like pre.cluster is still not having any impact on the ASVs. I may be wrong, but 367,139 ASVs still seem like an unusually high amount. The ASVs no longer have a size of one though.

mothur > summary.seqs(fasta = current, count = current)

Using /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/Unique/stability.trim.contigs.unique.good.good.filter.unique.precluster.denovo.vsearch.pick.count_table as input file for the count parameter.

Using /Volumes/FlashDrive/USA/MicrobiomeProject/Bioinformatics/Unique/stability.trim.contigs.unique.good.good.filter.unique.precluster.denovo.vsearch.pick.fasta as input file for the fasta parameter.

Using 10 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 1192 424 0 3 1

2.5%-tile: 1 1194 440 0 4 84526

25%-tile: 1 1194 442 0 6 845258

Median: 1 1194 445 0 6 1690516

75%-tile: 1 1194 465 0 6 2535774

97.5%-tile: 1 1194 465 0 6 3296506

Maximum: 2 1194 466 0 8 3381031

Mean: 1 1193 452 0 5

of unique seqs: 367139

total # of seqs: 3381031

Hey Joe -

Hmmm, This seems better, but it does still seem a bit weird that you aren’t collapsing more things together. It could be a product of using a region with minimal overlap between the reads. I also wonder whether your barcodes and primers are still on the sequences. If the barcodes were still on then that would artificially make the sequences look different. Can you look at some of the sequences and see if you find the barcodes and primers? Those could be removed in the make.contigs step or by adding a trim.seqs step, both using an oligos file. Otherwise, my only guess would be that there’s a lot of noise in the assembled sequences.

pat

Hi Pat,

I was able to confirm that the barcodes have been removed, but it does look like the primers are still attached.

Best,
Joe