unique.seq - large number

hello. I’m new to sequencing and trying to work my way up a rather steep learning curve. I have a few questions related to unique.seq:

We just completed 16S amplicon seq using pgm (316 chip). I wanted to generate phylogenetic trees for my samples and compare them (was planning to do this in MATLAB).
I am running Mothur within Galaxy platform and ran the following:

  • fastq.info
  • trim.seq (min length = 300, max. length = 380)
  • summary.seq (result below):
    Start End NBases Ambigs Polymer NumSeqs
    Minimum: 1044 1046 2 0 1 1
    2.5%-tile: 34102 41566 322 0 4 5214
    25%-tile: 34102 42546 349 0 4 52135
    Median: 34102 42546 350 0 4 104270
    75%-tile: 34102 42546 351 0 4 156405
    97.5%-tile: 34102 42546 354 0 5 203326
    Maximum: 43113 43116 377 0 17 208539
    Mean: 34101.6 42484.3 347.934 0 4.23547

of Seqs: 208539

  • unique.seq
  • summary.seq (result below):
    Start End NBases Ambigs Polymer NumSeqs
    Minimum: 1044 1046 2 0 1 1
    2.5%-tile: 34102 41536 312 0 4 2337
    25%-tile: 34102 42546 346 0 4 23370
    Median: 34102 42546 350 0 4 46739
    75%-tile: 34102 42546 351 0 5 70108
    97.5%-tile: 34102 42546 355 0 5 91140
    Maximum: 43113 43116 377 0 17 93476
    Mean: 34101.1 42423.5 345.901 0 4.307

of Seqs: 93476


  • I had not expected the high number of unique sequences - am I missing a step and/or doing something incorrectly?

Again - I’m new to this and if you can point me to something to read/review or have any suggestions - it would be much appreciated.


Start here…


I would add that IonTorrent data has the problems described in the blog post times 1000. I would try again and get MiSeq data. Sorry to sound so negative, but I have yet to see IonTorrent data on this forum or in my own lab that is anything I would want to work with.


thanks for the reply. Sounds more realistic than negative. Unfortunately, I don’t have the luxury of going out and getting more seq data - so, back to the drawing board it is…

If I could hijack this thread, I have a similar problem but we have used MiSeq.

It is V4, so the same region as used in the SOP. I have followed the SOP and just prior to dist.seqs, have 11 million sequences from 70 human stool samples and 230,000 unique sequences. Dist.seqs did run but gave produced an enormous .dist file which clearly cluster won’t cope with.

The issue is obviously the high number of unique sequences suggesting that there remains a high degree of error in the data. Is there anything I can do about this? Is there a more stringent quality filter I can apply at the start? Change the options in make.contigs?

Thanks in advance,


Though I don’t disagree with Pat regarding the lets say underwellming quality of data generated by Ion Torrent I have a less of a fatalist approach to it and I think the data is workable, here’s what worked for me.
First, out of curiosity I noticed you don’t have a lot of seqs in general, when I use a 316 chip I expect a min of 1.5 mil reads, normaly I get 4 mill, did you get rid of some samples?
Second, if you’re not following the 454 sop do it (http://www.mothur.org/wiki/454_SOP) , with that in mind, PGM generates a sff file don’t bother with the sff\ flow file rote, shhh.flows will not work well on PGM data in my experience. I don’t know if you incorporated Q scores in the process of trimming your seqs, but you want to encorporate qwindowaverage. The 454 sop recomends 35 I found that with pgm data it gets rid of more than 99% of my data (did I mention underwhelming?) so set it to 25. As you get closer to the end of your read the quality drops drastically I would also incorporate keepfirst=200 or 250 and minlendth 200 or 250 respectivally in your trim.seqs. I found this aproach to work reletively well, it will first take all of your seqs and keep only the first 200 or 250 bases and then get rid of all reads shorter than 200 or 250 bases (if a read is lets say 150 bases it will still keep it after keepfirst since 150 is within the 200-250 range.). That way you also make sure all your seqs are the same length and it will make your life easier down the line.
Alignment, use pcr.seqs to shorten the the silva dataset to your V region before you do the alignment, if you dont know the possitions you can align the parts of the e.coli 16s that corespond to your V region to silva and then use the positions you get for pcr.seqs. Again not clear if you done the alignemnt and pre.clustering steps but these steps will significantly reduce the number of uniques you are seeing.

Hope this helps

Here’s what I’m thinking, the MiSeq SOP data has 2609 uniqe seqs out of at tot of 119463 prior to dist.seqs wich makes it ~2% uniqes your set of reads has 230K uniques out of 11 mill total read, which also makes it ~2% uniques. Why would you automatically assume that the diversity you are seeing is from noisy data. Even if you had a larger proportion of uniques I wouldn’t automaticaly think the problem is with the data w/o first thinking of the evnironment it came from have to put everything in ecological context. I haven’t tryed it, but you can look at the full set of samples Pat has linked in the MiSeq ad see how that affects the distance matrix generated by dist.seqs. If you are really concered with the number of uniques you can up the diffs in pre.cluster to 2% (which is the recommandation in the Huse et al. 2010 Environ Microbiol paper). Just make sure you understand the meaning of doing it.

Thank you for your comment. The problem with your calculation is in predicting that the number of uniques will be a fixed proportion of the number of reads. The actual number of unique sequences in the community will be a fixed number and sufficient coverage should have been obtained with a substantially fewer number of reads that we obtained. Put another way, if you have 20 organisms in a mock community, you should see 20 OTUs whether you include 5000 or 50000 reads. Changing the sensitivity of pre.cluster was going to be my next step but if error is responsible for the problem, and it is random, I don’t expect this to help much. I may try trimmomatic to discard the worst reads and shorten others.

If all else fails, I shall have to resort to a phylotype-based analysis (Sorry, Pat!)

Another thing I’ve seen is that people are using the v3 chemistry on the V4 region without dialing back the number of cycles. This causes huge problems because you essentially sequence off the end of the read and jack up the error rate.