Hi there!!!
Including all the screening and chimera removal steps what is the maximum percentage of reads that is acceptable to remove?
Thanks,
Deep
When sequencing the V4 region with 2x250 sequencing something like removing 10-30% of all sequences with screen.seqs
would be reasonable and 5-25% of all sequences as chimeras would be typical. If your reads don’t fully overlap, I would expect you to lose more sequences. Also, if you are looking at the fraction of unique sequences rather than total sequences, I’d expect the percentages to be higher too.
Pat
I have a couple of queries/confusions that I mentioned in bold below.
Currently, I am using Illumina unpaired sequence data (V4 region). First I ran fastq.info to get the fasta and quality files. Then using trim.seqs
(with qwindowaverage = 35
and qwindowsize = 50
) I removed bad quality reads. After that, with unique sequences I have run alignment. I found something like this:
mothur > summary.seqs(fasta=merge.unique.align, count=merge.unique.count_table)
Using 16 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1968 4049 57 0 3 560722
25%-tile: 1968 9921 159 0 4 5607220
Median: 1968 10674 236 0 4 11214439
75%-tile: 1968 11546 251 0 5 16821658
97.5%-tile: 1968 11549 251 0 6 21868155
Maximum: 13425 13425 251 0 40 22428876
Mean: 2090 10108 203 0 4
# of unique seqs: 3472322
total # of seqs: 22428876
1. while using the following command, I lost 51% of all sequences like this:
mothur > screen.seqs(fasta=merge.unique.align, count=merge.unique.count_table, start=1968, end=10674, maxhomop=8)
mothur > summary.seqs(fasta=merge.unique.good.align, count=merge.unique.good.count_table)
Using 16 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1967 10674 229 0 3 1
2.5%-tile: 1968 10674 238 0 3 273825
25%-tile: 1968 11546 251 0 4 2738249
Median: 1968 11546 251 0 4 5476498
75%-tile: 1968 11546 251 0 5 8214747
97.5%-tile: 1968 11549 251 0 6 10679171
Maximum: 1968 13406 251 0 8 10952995
Mean: 1967 11379 249 0 4
# of unique seqs: 1294167
total # of seqs: 10952995
Now filter.seqs
with following command gives following summary:
mothur > filter.seqs(fasta=merge.unique.good.align, vertical=T, trump=.)
mothur > summary.seqs(fasta=merge.unique.good.filter.fasta, count=merge.unique.good.count_table)
Using 16 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 433 218 0 3 1
2.5%-tile: 1 435 237 0 3 273825
25%-tile: 1 435 238 0 4 2738249
Median: 1 435 238 0 4 5476498
75%-tile: 1 435 238 0 5 8214747
97.5%-tile: 1 435 238 0 6 10679171
Maximum: 1 435 246 0 8 10952995
Mean: 1 434 237 0 4
# of unique seqs: 1294167
total # of seqs: 10952995
2. I am worried that median number of bases reduced from 251 to 238. Should this be a problem? Now unique.seqs
command reduced unique sequence number from 1294167 to 914461.
3. I am worried that reduction in total number of sequences is from 22428876 to 10952995 which is a 51% reduction. I suspect where I will end up removing sequences after the chimera removal step
Please comment on the confusions and suggest if I should change anything in my protocol.
Many Thanks
1. while using the following command, I lost 51% of all sequences like this:
I suspect you had a bad sequencing run. It is what it is so I’d just move on with the data that you have
2. I am worried that median number of bases reduced from 251 to 238. Should this be a problem?
You will probably stay at 251 if in screen.seqs
you used end=11546
instead of 10674
3. I am worried that reduction in total number of sequences is from 22428876 to 10952995 which is a 51% reduction. I suspect where I will end up removing sequences after the chimera removal step
You will probably remove more sequences after the chimera checking.
Is there a reason that you don’t have paired sequence reads? This is probably why the sequence quality is pretty low. If you had better data you’d get more reads
Pat