An explosion of unique seqs

Hi, all

My laboratory group is running a soil 16S analysis. We used MiSeq kit v2 with an outuput of 7,5 GB and 500 cycles.
We ran the sequences in mothur, using the following pipeline:

make.contigs(file=All.files, align=needleman, processors=8)
summary.seqs(fasta=All.trim.contigs.fasta)
screen.seqs(fasta=All.trim.contigs.fasta, group=All.contigs.groups, maxambig=0, maxlength=275)
summary.seqs(fasta=All.trim.contigs.good.fasta)
unique.seqs(fasta=All.trim.contigs.good.fasta)
summary.seqs(fasta=All.trim.contigs.good.unique.fasta, name=All.trim.contigs.good.names)

The summary after unique.seqs gives the following information:

  • Start End NBases Ambigs Polymer NumSeqs
    Minimum: 1 153 153 0 3 1
    2.5%-tile: 1 236 236 0 4 131235
    25%-tile: 1 240 240 0 4 1312350
    Median: 1 240 240 0 4 2624699
    75%-tile: 1 240 240 0 5 3937048
    97.5%-tile: 1 241 241 0 6 5118162
    Maximum: 1 275 275 0 12 5249396
    Mean: 1 239.626 239.626 0 4.38168

of unique seqs: 4058509

total # of seqs: 5249396

As you can see, ~77% of all sequences are ‘unique’. So we decided to check the quality of the sequences with seqyclean, using phred values of 20 and 30.
When running the same pipeline above with a subset of the sequences after quality trimming in seqyclean, we got:

For phred= 20:

  • Start End NBases Ambigs Polymer NumSeqs
    Minimum: 1 164 164 0 3 1
    2.5%-tile: 1 236 236 0 4 6772
    25%-tile: 1 239 239 0 4 67711
    Median: 1 240 240 0 4 135421
    75%-tile: 1 240 240 0 5 203131
    97.5%-tile: 1 241 241 0 6 264070
    Maximum: 1 274 274 0 10 270841
    Mean: 1 239.527 239.527 0 4.35679

of unique seqs: 242091

total # of seqs: 270841

And for phred=30:

  • Start End NBases Ambigs Polymer NumSeqs
    Minimum: 1 152 152 0 3 1
    2.5%-tile: 1 234 234 0 4 5033
    25%-tile: 1 239 239 0 4 50322
    Median: 1 240 240 0 4 100643
    75%-tile: 1 240 240 0 5 150964
    97.5%-tile: 1 241 241 0 6 196252
    Maximum: 1 274 274 0 10 201284
    Mean: 1 239.003 239.003 0 4.33494

of unique seqs: 183985

total # of seqs: 201284

It didn’t seem to help at all…
Is there a way to solve this? We’re just starting to work with this kind of sequence and really could use some help.

Thanks!

Welcome to soils :slight_smile: Keep following the SOP especially including pre.clustering. Does that get you down to a manageable number? I had ~700k pre.clustered “uniques” that I managed to cluster with cluster.split, but just barely (because the alpha proteos are so abundant and diverse) by using a single processor on a high memory (128gb) machine

Hi, thanks for replying!

The first time I ran the SOP, I got 3,859,378 unique sequences (from a total of 5,259,281). After pre.cluster, there were 3,001,495 unique seqs, and after removing the chimeras, there were 2,076,182 ‘uniques’.
I then tried to run cluster.split:

cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, splitmethod=fasta, large=T, taxlevel=4, cutoff=0.03)

But two weeks later nothing came out, it got stuck :cry:

I thought that it could have something to do with the quality of the sequences, so I ran them in FastQC.
One of the samples seems pretty bad:

But the others look something like this:

I’m unsure if the parameters used in cluster.split are correct or whether it’s possible to run this with ~2M unique sequences…
Do you think that using phylotypes is the only way to go?

cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, splitmethod=fasta, large=T, taxlevel=4, cutoff=0.03)

First your cutoff should be higher-like .10 or .15. I don’t think that large=T is actually helpful, but maybe Pat or Sarah will weigh in on that. If that still hangs (possible because of certain very abundant groups), I’d change taxlevel=5 before I’d go phylotype. Actually, I’d use a greedy algorithm (uclust etc) before I’d go with phylotype

Hi everyone,
I ran into the same problem. According to Pat, I think it’s a sequencer problem; take a look at this reply:
http://mothur.ltcmp.net/t/align-seqs-ran-out-of-memory/1985/1

I’ll let you know if I hear anything else.
Cheers,
Fred

And check the blog:
http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Illumina has definitely been having some quality issues with their V2 chemistry over the past few months. If you call and badger them about it and have your read metrics handy they may be willing to provide you a replacement kit or to help trouble shoot what is going on.

pat