Inflated unique sequence count

I’m currently re-running an analysis using mothur v.1.44.3 on about 300 samples (Illumina MiSeq 2x250PE with EMP V4 primers) and running into problems with a large number of unique sequences which didn’t happen with my first run using mothur v.1.40.5 on the same fastq files and computer cluster. Following the SOP, I still have ~400k uniques for cluster.split() which produces ~100k OTUs whereas my first run produced a more reasonable ~9k OTUs. Now when I run my analysis with the older version, I’m still getting huge shared files and have 360k unique sequences at cluster.split with 6.8 million total sequences. My commands are exactly the same according to my records, so I’m not sure why this is happening! The problem happens early in the analysis, as the first unique.seqs() step is producing ~1 million uniques whereas it was ~450k uniques on the previous run. Here are my commands:

make.file(inputdir=., type=fastq, prefix=pp)
make.contigs(file=pp.files, processors=32)
screen.seqs(fasta=current, group=current, summary=current, maxambig=0, maxlength=275)
count.seqs(name=current, group=current)
pcr.seqs(fasta=silva.nr_v138.align, start=13862, end=23444, keepdots=F, processors=8)
rename.file(input=silva.nr_v138.pcr.align, new=silva.v4.fasta)
align.seqs(fasta=pp.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)
summary.seqs(fasta=current, count=current)
screen.seqs(fasta=current, count=current, summary=current, start=1967, end=11549, maxhomop=8)
summary.seqs(fasta=current, count=current)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
pre.cluster(count=current, fasta=current, diffs=2) 
chimera.vsearch(fasta=current, count=current, dereplicate=t, processors=32)
remove.seqs(fasta=current, accnos=current)
summary.seqs(fasta=current, count=current)
classify.seqs(fasta=current, count=current, reference=silva.v4.fasta,, cutoff=80)	
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Bacteria;Cyanobacteria;Cyanobacteriia;Chloroplast-Bacteria_unclassified;-Bacteria;Cyanobacteria/Chloroplast;-Mitochondria;-Unknown;-Archaea;-Eukaryota;), count=current)
dist.seqs(fasta=current, cutoff=0.03)
cluster(column=current, count=current)
classify.otu(list=current, count=current, taxonomy=current, label=0.03)
make.shared(list=current, count=current, label=0.03)

Any clue what’s going on here? Thanks in advance!

1 Like

Hi there,

It’s hard to say - the pipeline looks right. Can you maybe look at your pp.files file and make sure it only has each sample one time? That you’re getting twice the number of uniques is suspicious to me. Alternatively, can you post the output of running summary.seqs after make.contigs and from after unique.seqs?


This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.