I am processing amplified 16S rRNA gene data (targeting the V4–V5 region with primers 515F GTGCCAGCMGCCGCGGTAA and 907R CCGTCAATTCCTTTGAGTTT). My pipeline so far:
Removed adapters and primers.
Filtered reads with average quality score <20.
Merged paired-end reads.(make.contigs)
Plotted the distribution of merged fragment lengths. (summary.seqs)
Now I am uncertain about the best length filtering strategy.
One option I tried is keeping only the reads that fall within the 2.5%–97.5% percentile range of the length distribution (removing the rest as outliers).
However, I am not sure if this is appropriate, because it might discard valid biological sequences (some taxa may naturally produce slightly longer/shorter amplicons).
So my questions are:
Should I stick with percentile-based cutoffs (e.g., 2.5%–97.5%), or is it better to use fixed min/max cutoffs based on the expected amplicon length (e.g., 320–420 bp, but I’m not sure)?
Does mothur (or best practice in the field) recommend using a biologically defined window for length filtering instead of a purely statistical rule?
This is my data:
What does the distribution look like if you use something like the SILVA reference alignment for the region? I’d use that.
Also…
Removed adapters and primers.
Filtered reads with average quality score <20.
likely aren’t going to be helpful. make.contigs will remove the primers and barcodes and filtering reads based on quality scores may lead to a problem where you have a read in one file but not the other. That you still have a long homopolymer sequence and a sequence with a bunch of Ns suggests the filtering is not doing a great job of removing bad reads.
To me, it looks like you should be using 377 to 377. You might pull out the 468 nt long sequence and blast it to see what it is. I’m not sure how you got an 81 nt long sequence. Are you trimming the reads based on quality score too? In our experience, that type of trimming results in worse contigs, not better.