Hello. Thank you for your reply.
I lose around 80% of my reads during align.seqs. Would this having something to do with not narrowing down the reference alignment, or for some other reason unknown to me?
Since I do not know where my amplicon sits in the alignment, I ran the following command for align.seqs:
align.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.fasta, reference=…/bin/example_data/silva.bacteria/silva.bacteria.pcr.fasta)
The summary.seqs output prior to and after running the above command are as follows:
mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table)
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 150 150 0 3 1
2.5%-tile: 1 177 177 0 4 47798
25%-tile: 1 177 177 0 4 477973
Median: 1 177 177 0 4 955945
75%-tile: 1 177 177 0 5 1433917
97.5%-tile: 1 177 177 0 5 1864091
Maximum: 1 200 200 0 27 1911888
Mean: 1 176.997 176.997 0 4.43096
# of unique seqs: 655372
total # of seqs: 1911888
mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.align, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table)
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 2 7450 156 0 4 23021
25%-tile: 2 7450 156 0 4 230205
Median: 2 7450 156 0 4 460409
75%-tile: 2 7450 156 0 5 690613
97.5%-tile: 2 7452 156 0 5 897796
Maximum: 21228 21228 200 0 27 920816
Mean: 9.64788 7450.93 155.908 0 4.42693
# of unique seqs: 131351
total # of seqs: 920816
In comparison with the number of sequences lost in the tutorial, I seem to be losing a huge number of sequences during the quality control part of the workflow. Obviously I fear that this will be skewing my final result (i.e. I have a huge abundance of Clostridium species in my digesta samples, which is unexpected).
As previously mentioned, I am concerned that I am losing too many reads throughout the quality control process. I will try to highlight this below:
mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.fasta)
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 150 150 0 3 1
2.5%-tile: 1 177 177 0 4 75485
25%-tile: 1 177 177 0 4 754849
Median: 1 177 177 0 5 1509698
75%-tile: 1 265 265 0 10 2264547
97.5%-tile: 1 269 269 3 15 2943911
Maximum: 1 301 300 117 150 3019395
Mean: 1 206.669 206.668 0.285542 6.71277
# of Seqs: 3019395
screen.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.fasta, group=1169N0001_ATCACG_L001_R1_001.contigs.groups, maxambig=0, maxlength=200)
unique.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.fasta)
count.seqs(name=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.names, group=1169N0001_ATCACG_L001_R1_001.contigs.good.groups)
See previous summary.seqs tables for prior and after align.seqs output.
screen.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.align,
count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table, summary=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.summary, start=2, end=7452, maxhomop=8
summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.align, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.good.count_table)
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 2 7452 155 0 3 1
2.5%-tile: 2 7452 156 0 4 625
25%-tile: 2 7452 156 0 4 6249
Median: 2 7452 156 0 4 12498
75%-tile: 2 7452 156 0 5 18746
97.5%-tile: 2 7452 157 0 5 24370
Maximum: 2 8547 179 0 8 24994
Mean: 2 7452.89 156.05 0 4.42026
# of unique seqs: 5247
total # of seqs: 24994
filter.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.align, vertical=T, trump=.)
unique.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.good.count_table)
pre.cluster(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.count_table, diffs=2)
chimera.uchime(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
remove.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.uchime.accnos)
summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, processors=3)
Using 3 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 209 154 0 3 1
2.5%-tile: 1 211 156 0 4 625
25%-tile: 1 211 156 0 4 6247
Median: 1 211 156 0 4 12493
75%-tile: 1 211 156 0 5 18739
97.5%-tile: 1 211 157 0 5 24361
Maximum: 1 211 166 0 8 24985
Mean: 1 211 156.028 0 4.41921
# of unique seqs: 686
total # of seqs: 24985
At this time, I’m afraid I am not aware of the version of MiSeq used. I will find out soon if this is essential.
Thanks in advance for your help, much appreciated.
Kind regards,
Jo