Appropriate length filtering strategy for 16S V4–V5 (515F–907R) amplicons: percentile vs fixed cutoffs?

Echo · October 4, 2025, 2:50am

Hi all,

I am processing amplified 16S rRNA gene data (targeting the V4–V5 region with primers 515F GTGCCAGCMGCCGCGGTAA and 907R CCGTCAATTCCTTTGAGTTT). My pipeline so far:

Removed adapters and primers.

Filtered reads with average quality score <20.

Merged paired-end reads.(make.contigs)

Plotted the distribution of merged fragment lengths. (summary.seqs)

Now I am uncertain about the best length filtering strategy.

One option I tried is keeping only the reads that fall within the 2.5%–97.5% percentile range of the length distribution (removing the rest as outliers).

However, I am not sure if this is appropriate, because it might discard valid biological sequences (some taxa may naturally produce slightly longer/shorter amplicons).

So my questions are:

Should I stick with percentile-based cutoffs (e.g., 2.5%–97.5%), or is it better to use fixed min/max cutoffs based on the expected amplicon length (e.g., 320–420 bp, but I’m not sure)?

Does mothur (or best practice in the field) recommend using a biologically defined window for length filtering instead of a purely statistical rule?
This is my data:

Any guidance or references would be very helpful. Thanks in advance!

pschloss · October 6, 2025, 5:16pm

What does the distribution look like if you use something like the SILVA reference alignment for the region? I’d use that.

Also…

Removed adapters and primers.

Filtered reads with average quality score <20.

likely aren’t going to be helpful. make.contigs will remove the primers and barcodes and filtering reads based on quality scores may lead to a problem where you have a read in one file but not the other. That you still have a long homopolymer sequence and a sequence with a bunch of Ns suggests the filtering is not doing a great job of removing bad reads.

To me, it looks like you should be using 377 to 377. You might pull out the 468 nt long sequence and blast it to see what it is. I’m not sure how you got an 81 nt long sequence. Are you trimming the reads based on quality score too? In our experience, that type of trimming results in worse contigs, not better.

Pat

Echo · October 9, 2025, 8:20am

Thank you so much for your kindly help.

Followed your suggestion, I used my raw data to make.contigs where oligos removed primers:

make.contigs(file=stability.files, oligos=oligos.oligos, pdiffs=2, checkorient=TRUE, maxambig=0, maxhomop=8, processors=10)

Then the result below became better than using pre-cutadapter:

        Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum: 1 229 229 0 3 1
2.5%-tile: 1 366 366 0 4 166873
25%-tile: 1 372 372 0 4 1668730
Median: 1 374 374 0 5 3337459
75%-tile: 1 374 374 0 5 5006188
97.5%-tile: 1 377 377 0 6 6508044
Maximum: 1 448 448 0 8 6674916
Mean: 1 372 372 0 4

of Seqs: 6674916

Later, I aligned the stability.trim.contigs.fasta with SILVA database, the summary was:

        Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 13862 27654 366 0 4 166873
25%-tile: 13862 27654 372 0 4 1668730
Median: 13862 27654 374 0 5 3337459
75%-tile: 13862 27654 374 0 5 5006188
97.5%-tile: 13862 27654 377 0 6 6508044
Maximum: 43115 43116 448 0 8 6674916
Mean: 13856 27638 372 0 4

of Seqs: 6674916

So I only kept contigs longer than 366.

Your suggestions solved my problem.

Best regards,

Echo Luo

pschloss · October 9, 2025, 1:55pm

That’s fantastic! Thanks for the update

system · October 19, 2025, 1:56pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What criteria to remove shorter and longer of contigs sequence using screen.seqs Theory behind mothur	4	1529	March 13, 2017
Choosing length after first command line Commands in mothur	4	3042	September 15, 2014
All sequences removed with screen.seqs Commands in mothur	6	941	April 8, 2021
Minimum Quality Requirements for analysing Amplicon Sequencing data ( 16S / 18S ) with mothur pipeline Commands in mothur	3	1023	October 12, 2017
More than 90% sequences lost after screening Commands in mothur	1	818	December 12, 2016

Appropriate length filtering strategy for 16S V4–V5 (515F–907R) amplicons: percentile vs fixed cutoffs?

Related topics