Minimum Quality Requirements for analysing Amplicon Sequencing data ( 16S / 18S ) with mothur pipeline

Hi,

I was planning to pursue 16S V4 analysis on amplicon sequencing data using mothur. Basically, when I quality filter based on some simple set of criteria, I tend to obtain very low number of good quality reads per sample.

Conditions used to Quality filter Reads

Minimum length of read: 200
Minimum Q-value of each base position: 15
Minimum Mean Q-value of each read: 20
Maximum Ns allowed per read: 4
QC software used: prinseq/0.20.4

Sample Num_R1_before Num_R2_before Num_R1_after Num_R2_after
test_sample_01 600381 600381 16 16
test_sample_02 493191 493191 10 10
test_sample_03 435412 435412 8 8
test_sample_04 460862 460862 20 20
test_sample_05 567018 567018 4 4
test_sample_06 407389 407389 3 3
test_sample_07 549802 549802 6 6
test_sample_08 403641 403641 2 2
test_sample_09 292051 292051 3 3
test_sample_10 444006 444006 10 10

My Questions:

  1. Is it a common experience to see such low quality sequences from Amplicon sequencing libraries on the Illumina MiSeq platform?

  2. Based on the above observations ( QC outcomes ), can I reject these samples as being of very low quality not suitable for metagenomic analysis using the mothur pipeline.

  3. Any specific suggestion(s) that I should observe while analysing such amplicon sequencing within mothur framework.

Thank you.

Have you run make.contigs without doing the quality filtering? We go directly from the sequencer to make.contigs. Have you sequenced a mock community?

Pat

Please excuse me for my late reply. Yes, I initially applied make.contigs without any external quality filtering. As I am currently working on a test data, I don’t have a mock community.

Basically, I proceeded the standard way:
make.contigs -> summary.seqs -> screen.seqs -> unique.seqs -> count.seqs -> align.seqs -> summary.seqs

When I saw the last outcome, I thought may be my quality is to blame for the poor results ( shown below ). That was when I started looking at quality aspect of my data.

mothur > summary.seqs(fasta=fulltest.trim.contigs.good.unique.align, count=fulltest.trim.contigs.good.count_table, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 2 13421 2 0 2 331814
25%-tile: 2 13421 2 0 2 3318132
Median: 13421 13423 2 0 2 6636263
75%-tile: 13421 13423 290 0 4 9954394
97.5%-tile: 13421 13423 290 0 6 12940712
Maximum: 13423 13423 316 0 8 13272525
Mean: 9446.18 13412.4 87.4228 0 2.81171

of unique seqs: 8336372

total # of seqs: 13272525

Particularly, I am challenged to see that aligned sequences could Start @ 13421 and End @ 13423, yet can have length 290. I am eager to know how this could happen.
I also wish to know if can one still be able to pursue 16S analysis with mothur in the absence of a mock community.

Thank you.

You might also want to check out the rename option in make.contigs which converts the names to numbers.

Pat