No.of unique sequence decreased to 15% after chimera.uch

Hello ,

  1. I used greengenes reference taxonomy and alignment file. After doing Chimera.uchime and remove.seqs, I lost 85% of my unique sequences. Why ?

  2. Another question, why screen.seq did not work ? Below is the summary written.

mothur > align.seqs(fasta=stability.trim.contigs.trim.good.unique.fasta, reference=gg.refalign, flip=T)

mothur > summary.seqs(fasta=current)


Start End NBases Ambigs Polymer NumSeqs Minimum: 5 2263 370 0 3 1 2.5%-tile: 9 2266 370 0 4 312 25%-tile: 9 2266 371 0 4 3119 Median: 9 2266 372 0 5 6237 75%-tile: 9 2266 373 0 5 9355 97.5%-tile: 9 2266 376 0 6 12161 Maximum: 13 2293 376 0 8 12472 Mean: 9.01379 2266.03 372.169 0 4.71208 # of Seqs: 12472
mothur > screen.seqs(fasta=stability.trim.contigs.trim.good.unique.align, count=stability.trim.contigs.trim.good.count_table, summary=stability.trim.contigs.trim.good.unique.summary, start=9, end=2266, maxhomop=8)

It took 1 secs to screen 12472 sequences.

mothur > summary.seqs(fasta=current, count=current)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 5 2266 370 0 3 1
2.5%-tile: 9 2266 370 0 4 3339
25%-tile: 9 2266 372 0 4 33387
Median: 9 2266 372 0 5 66774
75%-tile: 9 2266 373 0 5 100161
97.5%-tile: 9 2266 376 0 6 130209
Maximum: 9 2293 376 0 8 133547
Mean: 8.99997 2266 372.232 0 4.62525

of unique seqs: 12400

total # of seqs: 133547


Looking forward for suggestion.
  1. I used greengenes reference taxonomy and alignment file. After doing Chimera.uchime and remove.seqs, I lost 85% of my unique sequences. Why ?

Because a lot of your unique reads were chimeras. What percentage of your total reads were discarded?

  1. Another question, why screen.seq did not work ? Below is the summary written.

Looks like it worked, what am I missing?

Hi Dr. Schloss,

  1. I realised from your question (what is the percentage of my total reads discarded) that I did not loose most of the total sequences. I still have 98% of total sequences. This means only 2 % of the total sequence makes 85% of rubbish unique sequences :shock: …


2) I wanted to exclude sequnce below 5 (start) and above 2266 (end). but I can still see them in summary.seq. However I can see reduction in no. of sequences.
Thank you

the start option removes sequences that start after the start position and the end option removes those that end before the end position.