Appropriate length filtering strategy for 16S V4–V5 (515F–907R) amplicons: percentile vs fixed cutoffs?

Hi all,

I am processing amplified 16S rRNA gene data (targeting the V4–V5 region with primers 515F GTGCCAGCMGCCGCGGTAA and 907R CCGTCAATTCCTTTGAGTTT). My pipeline so far:

Removed adapters and primers.

Filtered reads with average quality score <20.

Merged paired-end reads.(make.contigs)

Plotted the distribution of merged fragment lengths. (summary.seqs)

Now I am uncertain about the best length filtering strategy.

One option I tried is keeping only the reads that fall within the 2.5%–97.5% percentile range of the length distribution (removing the rest as outliers).

However, I am not sure if this is appropriate, because it might discard valid biological sequences (some taxa may naturally produce slightly longer/shorter amplicons).

So my questions are:

Should I stick with percentile-based cutoffs (e.g., 2.5%–97.5%), or is it better to use fixed min/max cutoffs based on the expected amplicon length (e.g., 320–420 bp, but I’m not sure)?

Does mothur (or best practice in the field) recommend using a biologically defined window for length filtering instead of a purely statistical rule?
This is my data:

Any guidance or references would be very helpful. Thanks in advance!

What does the distribution look like if you use something like the SILVA reference alignment for the region? I’d use that.

Also…

Removed adapters and primers.

Filtered reads with average quality score <20.

likely aren’t going to be helpful. make.contigs will remove the primers and barcodes and filtering reads based on quality scores may lead to a problem where you have a read in one file but not the other. That you still have a long homopolymer sequence and a sequence with a bunch of Ns suggests the filtering is not doing a great job of removing bad reads.

To me, it looks like you should be using 377 to 377. You might pull out the 468 nt long sequence and blast it to see what it is. I’m not sure how you got an 81 nt long sequence. Are you trimming the reads based on quality score too? In our experience, that type of trimming results in worse contigs, not better.

Pat

Thank you so much for your kindly help.

Followed your suggestion, I used my raw data to make.contigs where oligos removed primers:

make.contigs(file=stability.files, oligos=oligos.oligos, pdiffs=2, checkorient=TRUE, maxambig=0, maxhomop=8, processors=10)

Then the result below became better than using pre-cutadapter:

        Start   End     NBases  Ambigs  Polymer NumSeqs

Minimum: 1 229 229 0 3 1
2.5%-tile: 1 366 366 0 4 166873
25%-tile: 1 372 372 0 4 1668730
Median: 1 374 374 0 5 3337459
75%-tile: 1 374 374 0 5 5006188
97.5%-tile: 1 377 377 0 6 6508044
Maximum: 1 448 448 0 8 6674916
Mean: 1 372 372 0 4

of Seqs: 6674916

Later, I aligned the stability.trim.contigs.fasta with SILVA database, the summary was:

        Start   End     NBases  Ambigs  Polymer NumSeqs

Minimum: 0 0 0 0 1 1
2.5%-tile: 13862 27654 366 0 4 166873
25%-tile: 13862 27654 372 0 4 1668730
Median: 13862 27654 374 0 5 3337459
75%-tile: 13862 27654 374 0 5 5006188
97.5%-tile: 13862 27654 377 0 6 6508044
Maximum: 43115 43116 448 0 8 6674916
Mean: 13856 27638 372 0 4

of Seqs: 6674916

So I only kept contigs longer than 366.

Your suggestions solved my problem.

Best regards,

Echo Luo

That’s fantastic! Thanks for the update

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.