High no. of unique sequence_problem

Hello,
I used pcr.seq to remove my oligos. After this, I get sequence length of (slightly) variable length. I am considering only 367-376 here because in the next step I will exclude below 367 and 376.
Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 216 216 0 3 1
2.5%-tile: 1 367 367 0 4 268439
25%-tile: 1 370 370 0 4 2684389
Median: 1 372 372 0 5 5368778
75%-tile: 1 373 373 1 5 8053166
97.5%-tile: 1 376 376 9 6 10469116
Maximum: 1 462 462 56 222 10737554
Mean: 1 371.811 371.811 1.14461 4.71137

of Seqs: 10737554

Following I screened minimum and maximum length, and below is the summary seq.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 367 367 0 3 1
2.5%-tile: 1 368 368 0 4 259856
25%-tile: 1 370 370 0 4 2598558
Median: 1 372 372 0 5 5197115
75%-tile: 1 373 373 1 5 7795672
97.5%-tile: 1 376 376 9 6 10134374
Maximum: 1 376 376 32 113 10394229
Mean: 1 371.84 371.84 1.12945 4.71958

of Seqs: 10394229

Output File Names:
stability.trim.contigs.trim.good.good.summary

I chose to take sequnces of length 367 to 376. After the step of unique. seqs, I am getting 1166568 sequences

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 367 367 0 3 1
2.5%-tile: 1 368 368 0 4 178055
25%-tile: 1 370 370 0 4 1780543
Median: 1 372 372 0 5 3561085
75%-tile: 1 373 373 0 5 5341627
97.5%-tile: 1 376 376 0 6 6944115
Maximum: 1 376 376 0 70 7122169
Mean: 1 371.93 371.93 0 4.67692

of unique seqs: 1166568

total # of seqs: 7122169

Q1 )This no. of unique sequences is too much and I faced problem very recently at the step of cluster.split (mothur stopped proceeding). What should I do? Is taking minimum and maximum length of 367 to 376 respectively ok ? Please suggest.
Just to add something to my above question. I have some libraries that have read quality score below 28. I want to see how they get processed with Mothur on the basis of belief that not all the sequences in any library are bad. There must be many sequences which are of good quality. And I understand that Mothur does quality filtering very well. But as I suspect, are these are making the no. of unique sequences high ???

Q2) If variable length of sequences (with 10 bases) is a problem, how can I trim them and make them of equalual length using Mothur ?

Q3) My third question is, during another mothur run, at the step of cluter. split, I got following message.

mothur > cluster.split(fasta=szn.trim.contigs.trim.good.good.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=szn.trim.contigs.trim.good.good.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, taxonomy=szn.trim.contigs.trim.good.good.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

Why cutoff from 0.15 changed to 0.05?

Using 4 processors.
Using splitmethod fasta.
Splitting the file…
/******************************************/…………………………………………………….

Clustering szn.trim.contigs.trim.good.good.good.unique.good.filter.unique.precluster.pick.pick.fasta.0.dist
Cutoff was 0.155 changed cutoff to 0.05
Cutoff was 0.155 changed cutoff to 0.05
It took 6306 seconds to cluster
Merging the clustered files…
It took 7 seconds to merge.



Looking forward.

Richa

I’ve answered this elsewhere for you, but just in case anyone else comes along looking for an answer…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/