What is the acceptable limit of sequence removal?

DEEPCHANDA7 · January 19, 2023, 10:12am

Hi there!!!
Including all the screening and chimera removal steps what is the maximum percentage of reads that is acceptable to remove?
Thanks,
Deep

pschloss · January 19, 2023, 2:47pm

When sequencing the V4 region with 2x250 sequencing something like removing 10-30% of all sequences with screen.seqs would be reasonable and 5-25% of all sequences as chimeras would be typical. If your reads don’t fully overlap, I would expect you to lose more sequences. Also, if you are looking at the fraction of unique sequences rather than total sequences, I’d expect the percentages to be higher too.

Pat

DEEPCHANDA7 · February 4, 2023, 10:14am

I have a couple of queries/confusions that I mentioned in bold below.
Currently, I am using Illumina unpaired sequence data (V4 region). First I ran fastq.info to get the fasta and quality files. Then using trim.seqs (with qwindowaverage = 35 and qwindowsize = 50) I removed bad quality reads. After that, with unique sequences I have run alignment. I found something like this:

mothur > summary.seqs(fasta=merge.unique.align, count=merge.unique.count_table)

Using 16 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	0	0	0	0	1	1
2.5%-tile:	1968	4049	57	0	3	560722
25%-tile:	1968	9921	159	0	4	5607220
Median: 	1968	10674	236	0	4	11214439
75%-tile:	1968	11546	251	0	5	16821658
97.5%-tile:	1968	11549	251	0	6	21868155
Maximum:	13425	13425	251	0	40	22428876
Mean:	2090	10108	203	0	4
# of unique seqs:	3472322
total # of seqs:	22428876

1. while using the following command, I lost 51% of all sequences like this:

mothur > screen.seqs(fasta=merge.unique.align, count=merge.unique.count_table, start=1968, end=10674,  maxhomop=8)

mothur > summary.seqs(fasta=merge.unique.good.align, count=merge.unique.good.count_table)

Using 16 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1967	10674	229	0	3	1
2.5%-tile:	1968	10674	238	0	3	273825
25%-tile:	1968	11546	251	0	4	2738249
Median: 	1968	11546	251	0	4	5476498
75%-tile:	1968	11546	251	0	5	8214747
97.5%-tile:	1968	11549	251	0	6	10679171
Maximum:	1968	13406	251	0	8	10952995
Mean:	1967	11379	249	0	4
# of unique seqs:	1294167
total # of seqs:	10952995

Now filter.seqs with following command gives following summary:

mothur > filter.seqs(fasta=merge.unique.good.align, vertical=T, trump=.)

mothur > summary.seqs(fasta=merge.unique.good.filter.fasta, count=merge.unique.good.count_table)

Using 16 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	433	218	0	3	1
2.5%-tile:	1	435	237	0	3	273825
25%-tile:	1	435	238	0	4	2738249
Median: 	1	435	238	0	4	5476498
75%-tile:	1	435	238	0	5	8214747
97.5%-tile:	1	435	238	0	6	10679171
Maximum:	1	435	246	0	8	10952995
Mean:	1	434	237	0	4
# of unique seqs:	1294167
total # of seqs:	10952995

2. I am worried that median number of bases reduced from 251 to 238. Should this be a problem? Now unique.seqs command reduced unique sequence number from 1294167 to 914461.

3. I am worried that reduction in total number of sequences is from 22428876 to 10952995 which is a 51% reduction. I suspect where I will end up removing sequences after the chimera removal step
Please comment on the confusions and suggest if I should change anything in my protocol.
Many Thanks

pschloss · February 7, 2023, 7:00pm

1. while using the following command, I lost 51% of all sequences like this:

I suspect you had a bad sequencing run. It is what it is so I’d just move on with the data that you have

2. I am worried that median number of bases reduced from 251 to 238. Should this be a problem?

You will probably stay at 251 if in screen.seqs you used end=11546 instead of 10674

3. I am worried that reduction in total number of sequences is from 22428876 to 10952995 which is a 51% reduction. I suspect where I will end up removing sequences after the chimera removal step

You will probably remove more sequences after the chimera checking.

Is there a reason that you don’t have paired sequence reads? This is probably why the sequence quality is pretty low. If you had better data you’d get more reads

Pat

Topic		Replies	Views
screen.seqs: removal of high percent of sequences mothur bugs	3	1743	May 4, 2016
Reduction in number of sequences	3	486	April 18, 2020
high chimera percentage using chimera.uchime Commands in mothur	8	4634	January 9, 2017
No.of unique sequence decreased to 15% after chimera.uch Commands in mothur	3	3078	October 28, 2014
0 sequences removed after remove.seqs mothur bugs	2	880	August 30, 2018

What is the acceptable limit of sequence removal?

Related topics