What is the acceptable limit of sequence removal?

Hi there!!!
Including all the screening and chimera removal steps what is the maximum percentage of reads that is acceptable to remove?
Thanks,
Deep

When sequencing the V4 region with 2x250 sequencing something like removing 10-30% of all sequences with screen.seqs would be reasonable and 5-25% of all sequences as chimeras would be typical. If your reads don’t fully overlap, I would expect you to lose more sequences. Also, if you are looking at the fraction of unique sequences rather than total sequences, I’d expect the percentages to be higher too.

Pat

1 Like

I have a couple of queries/confusions that I mentioned in bold below.
Currently, I am using Illumina unpaired sequence data (V4 region). First I ran fastq.info to get the fasta and quality files. Then using trim.seqs (with qwindowaverage = 35 and qwindowsize = 50) I removed bad quality reads. After that, with unique sequences I have run alignment. I found something like this:

mothur > summary.seqs(fasta=merge.unique.align, count=merge.unique.count_table)

Using 16 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	0	0	0	0	1	1
2.5%-tile:	1968	4049	57	0	3	560722
25%-tile:	1968	9921	159	0	4	5607220
Median: 	1968	10674	236	0	4	11214439
75%-tile:	1968	11546	251	0	5	16821658
97.5%-tile:	1968	11549	251	0	6	21868155
Maximum:	13425	13425	251	0	40	22428876
Mean:	2090	10108	203	0	4
# of unique seqs:	3472322
total # of seqs:	22428876

1. while using the following command, I lost 51% of all sequences like this:

mothur > screen.seqs(fasta=merge.unique.align, count=merge.unique.count_table, start=1968, end=10674,  maxhomop=8)
mothur > summary.seqs(fasta=merge.unique.good.align, count=merge.unique.good.count_table)

Using 16 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1967	10674	229	0	3	1
2.5%-tile:	1968	10674	238	0	3	273825
25%-tile:	1968	11546	251	0	4	2738249
Median: 	1968	11546	251	0	4	5476498
75%-tile:	1968	11546	251	0	5	8214747
97.5%-tile:	1968	11549	251	0	6	10679171
Maximum:	1968	13406	251	0	8	10952995
Mean:	1967	11379	249	0	4
# of unique seqs:	1294167
total # of seqs:	10952995

Now filter.seqs with following command gives following summary:

mothur > filter.seqs(fasta=merge.unique.good.align, vertical=T, trump=.)
mothur > summary.seqs(fasta=merge.unique.good.filter.fasta, count=merge.unique.good.count_table)

Using 16 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	433	218	0	3	1
2.5%-tile:	1	435	237	0	3	273825
25%-tile:	1	435	238	0	4	2738249
Median: 	1	435	238	0	4	5476498
75%-tile:	1	435	238	0	5	8214747
97.5%-tile:	1	435	238	0	6	10679171
Maximum:	1	435	246	0	8	10952995
Mean:	1	434	237	0	4
# of unique seqs:	1294167
total # of seqs:	10952995

2. I am worried that median number of bases reduced from 251 to 238. Should this be a problem? Now unique.seqs command reduced unique sequence number from 1294167 to 914461.

3. I am worried that reduction in total number of sequences is from 22428876 to 10952995 which is a 51% reduction. I suspect where I will end up removing sequences after the chimera removal step
Please comment on the confusions and suggest if I should change anything in my protocol.
Many Thanks

1. while using the following command, I lost 51% of all sequences like this:

I suspect you had a bad sequencing run. It is what it is so I’d just move on with the data that you have

2. I am worried that median number of bases reduced from 251 to 238. Should this be a problem?

You will probably stay at 251 if in screen.seqs you used end=11546 instead of 10674

3. I am worried that reduction in total number of sequences is from 22428876 to 10952995 which is a 51% reduction. I suspect where I will end up removing sequences after the chimera removal step

You will probably remove more sequences after the chimera checking.

Is there a reason that you don’t have paired sequence reads? This is probably why the sequence quality is pretty low. If you had better data you’d get more reads

Pat

1 Like