An explosion of unique seqs

aliemelo · June 19, 2015, 8:16pm

Hi, all

My laboratory group is running a soil 16S analysis. We used MiSeq kit v2 with an outuput of 7,5 GB and 500 cycles.
We ran the sequences in mothur, using the following pipeline:

make.contigs(file=All.files, align=needleman, processors=8)
summary.seqs(fasta=All.trim.contigs.fasta)
screen.seqs(fasta=All.trim.contigs.fasta, group=All.contigs.groups, maxambig=0, maxlength=275)
summary.seqs(fasta=All.trim.contigs.good.fasta)
unique.seqs(fasta=All.trim.contigs.good.fasta)
summary.seqs(fasta=All.trim.contigs.good.unique.fasta, name=All.trim.contigs.good.names)

The summary after unique.seqs gives the following information:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 153 153 0 3 1
2.5%-tile: 1 236 236 0 4 131235
25%-tile: 1 240 240 0 4 1312350
Median: 1 240 240 0 4 2624699
75%-tile: 1 240 240 0 5 3937048
97.5%-tile: 1 241 241 0 6 5118162
Maximum: 1 275 275 0 12 5249396
Mean: 1 239.626 239.626 0 4.38168

of unique seqs: 4058509

total # of seqs: 5249396

As you can see, ~77% of all sequences are ‘unique’. So we decided to check the quality of the sequences with seqyclean, using phred values of 20 and 30.
When running the same pipeline above with a subset of the sequences after quality trimming in seqyclean, we got:

For phred= 20:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 164 164 0 3 1
2.5%-tile: 1 236 236 0 4 6772
25%-tile: 1 239 239 0 4 67711
Median: 1 240 240 0 4 135421
75%-tile: 1 240 240 0 5 203131
97.5%-tile: 1 241 241 0 6 264070
Maximum: 1 274 274 0 10 270841
Mean: 1 239.527 239.527 0 4.35679

of unique seqs: 242091

total # of seqs: 270841

And for phred=30:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 152 152 0 3 1
2.5%-tile: 1 234 234 0 4 5033
25%-tile: 1 239 239 0 4 50322
Median: 1 240 240 0 4 100643
75%-tile: 1 240 240 0 5 150964
97.5%-tile: 1 241 241 0 6 196252
Maximum: 1 274 274 0 10 201284
Mean: 1 239.003 239.003 0 4.33494

of unique seqs: 183985

total # of seqs: 201284

It didn’t seem to help at all…
Is there a way to solve this? We’re just starting to work with this kind of sequence and really could use some help.

Thanks!

Kendra · June 20, 2015, 1:19pm

Welcome to soils Keep following the SOP especially including pre.clustering. Does that get you down to a manageable number? I had ~700k pre.clustered “uniques” that I managed to cluster with cluster.split, but just barely (because the alpha proteos are so abundant and diverse) by using a single processor on a high memory (128gb) machine

aliemelo · June 22, 2015, 6:33pm

Hi, thanks for replying!

The first time I ran the SOP, I got 3,859,378 unique sequences (from a total of 5,259,281). After pre.cluster, there were 3,001,495 unique seqs, and after removing the chimeras, there were 2,076,182 ‘uniques’.
I then tried to run cluster.split:

cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, splitmethod=fasta, large=T, taxlevel=4, cutoff=0.03)

But two weeks later nothing came out, it got stuck

I thought that it could have something to do with the quality of the sequences, so I ran them in FastQC.
One of the samples seems pretty bad:

But the others look something like this:

I’m unsure if the parameters used in cluster.split are correct or whether it’s possible to run this with ~2M unique sequences…
Do you think that using phylotypes is the only way to go?

Kendra · June 23, 2015, 6:13pm

cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, splitmethod=fasta, large=T, taxlevel=4, cutoff=0.03)

First your cutoff should be higher-like .10 or .15. I don’t think that large=T is actually helpful, but maybe Pat or Sarah will weigh in on that. If that still hangs (possible because of certain very abundant groups), I’d change taxlevel=5 before I’d go phylotype. Actually, I’d use a greedy algorithm (uclust etc) before I’d go with phylotype

fibar · July 27, 2015, 5:42pm

Hi everyone,
I ran into the same problem. According to Pat, I think it’s a sequencer problem; take a look at this reply:
http://mothur.ltcmp.net/t/align-seqs-ran-out-of-memory/1985/1

I’ll let you know if I hear anything else.
Cheers,
Fred

fibar · July 27, 2015, 8:19pm

And check the blog:
http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

pschloss · August 3, 2015, 2:41pm

Illumina has definitely been having some quality issues with their V2 chemistry over the past few months. If you call and badger them about it and have your read metrics handy they may be willing to provide you a replacement kit or to help trouble shoot what is going on.

pat

Topic		Replies	Views
How many normal unique sequences after unique.seq Commands in mothur	7	5069	March 17, 2014
Bug in align.seqs? Commands in mothur	6	3772	February 1, 2015
unique.seq - large number Commands in mothur	7	5748	October 20, 2014
unique.seqs command Commands in mothur	4	34403	February 11, 2013
Empty folder after the unique.seqs command Commands in mothur	2	662	October 13, 2019

An explosion of unique seqs

of unique seqs: 4058509

of unique seqs: 242091

of unique seqs: 183985

Related topics