Align.seqs - queries.

Hello,

I have taken one of my samples through the MiSeq SOP, and have some queries surrounding the align.seqs command.

I am unsure as to how to map my V3 product to the reference, so I did not specify start and end parameters before running align.seqs. Below is my summary table:

mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.align,count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table)


Using 1 processors.

                  Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0         0         0            0             1       1
2.5%-tile:        2         7450    156         0            4       23021
25%-tile:         2         7450    156         0            4       230205
Median:          2         7450    156         0            4       460409
75%-tile:         2         7450    156         0            5       690613
97.5%-tile:      2         7452    156         0            5       897796
Maximum:      21228   21228   200        0            27      920816
Mean:         9.64788 7450.93 155.908   0       4.42693
# of unique seqs:       131351
total # of seqs:        920816

Output File Names:
1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.summary

I have a couple of questions:

(a) What does the start and end position in the above table relate to? I assumed that it meant the start and end position in the alignment, but obviously this doesn’t add up to ~ 156bp. Apologies for this potentially “stupid” question.
(b) I lose around 80% of my reads during align.seqs. Would this having something to do with not narrowing down the reference alignment, or for some other reason unknown to me?

In comparison with the number of sequences lost in the tutorial, I seem to be losing a huge number of sequences during the quality control part of the workflow. Obviously I fear that this will be skewing my final result (i.e. I have a huge abundance of Clostridium species in my digesta samples, which is unexpected).

Thanks for any advice provided,

Jo

(a) What does the start and end position in the above table relate to? I assumed that it meant the start and end position in the alignment, but obviously this doesn’t add up to ~ 156bp. Apologies for this potentially “stupid” question.

It is in the alignment. Remember that there are gaps in the alignment and so that is why End-Start > Nbases.

(b) I lose around 80% of my reads during align.seqs. Would this having something to do with not narrowing down the reference alignment, or for some other reason unknown to me?

Not sure what you mean. Can you post the pre and post summary.seqs output? There is nothing in align.seqs that throws out reads.

In comparison with the number of sequences lost in the tutorial, I seem to be losing a huge number of sequences during the quality control part of the workflow. Obviously I fear that this will be skewing my final result (i.e. I have a huge abundance of Clostridium species in my digesta samples, which is unexpected).

I’m afraid we need more specifics. summary.seqs output, exact commands you’re running, and version of MiSeq.

Hello. Thank you for your reply.

I lose around 80% of my reads during align.seqs. Would this having something to do with not narrowing down the reference alignment, or for some other reason unknown to me?

Since I do not know where my amplicon sits in the alignment, I ran the following command for align.seqs:

align.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.fasta, reference=…/bin/example_data/silva.bacteria/silva.bacteria.pcr.fasta)

The summary.seqs output prior to and after running the above command are as follows:

mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table)


Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       150     150     0       3       1
2.5%-tile:      1       177     177     0       4       47798
25%-tile:       1       177     177     0       4       477973
Median:         1       177     177     0       4       955945
75%-tile:       1       177     177     0       5       1433917
97.5%-tile:     1       177     177     0       5       1864091
Maximum:        1       200     200     0       27      1911888
Mean:   1       176.997 176.997 0       4.43096
# of unique seqs:       655372
total # of seqs:        1911888
mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.align, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table)


Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      2       7450    156     0       4       23021
25%-tile:       2       7450    156     0       4       230205
Median:         2       7450    156     0       4       460409
75%-tile:       2       7450    156     0       5       690613
97.5%-tile:     2       7452    156     0       5       897796
Maximum:        21228   21228   200     0       27      920816
Mean:   9.64788 7450.93 155.908 0       4.42693
# of unique seqs:       131351
total # of seqs:        920816

In comparison with the number of sequences lost in the tutorial, I seem to be losing a huge number of sequences during the quality control part of the workflow. Obviously I fear that this will be skewing my final result (i.e. I have a huge abundance of Clostridium species in my digesta samples, which is unexpected).

As previously mentioned, I am concerned that I am losing too many reads throughout the quality control process. I will try to highlight this below:

mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.fasta)

Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       150     150     0       3       1
2.5%-tile:      1       177     177     0       4       75485
25%-tile:       1       177     177     0       4       754849
Median:         1       177     177     0       5       1509698
75%-tile:       1       265     265     0       10      2264547
97.5%-tile:     1       269     269     3       15      2943911
Maximum:        1       301     300     117     150     3019395
Mean:   1       206.669 206.668 0.285542        6.71277
# of Seqs:      3019395
screen.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.fasta, group=1169N0001_ATCACG_L001_R1_001.contigs.groups, maxambig=0, maxlength=200)
unique.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.fasta)
count.seqs(name=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.names, group=1169N0001_ATCACG_L001_R1_001.contigs.good.groups)

See previous summary.seqs tables for prior and after align.seqs output.

screen.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.align, 
count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table, summary=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.summary, start=2, end=7452, maxhomop=8
summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.align, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.good.count_table)

Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        2       7452    155     0       3       1
2.5%-tile:      2       7452    156     0       4       625
25%-tile:       2       7452    156     0       4       6249
Median:         2       7452    156     0       4       12498
75%-tile:       2       7452    156     0       5       18746
97.5%-tile:     2       7452    157     0       5       24370
Maximum:        2       8547    179     0       8       24994
Mean:   2       7452.89 156.05  0       4.42026
# of unique seqs:       5247
total # of seqs:        24994

filter.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.align, vertical=T, trump=.)

unique.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.good.count_table)

pre.cluster(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.count_table, diffs=2)

chimera.uchime(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

remove.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.uchime.accnos)

summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, processors=3)

Using 3 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       209     154     0       3       1
2.5%-tile:      1       211     156     0       4       625
25%-tile:       1       211     156     0       4       6247
Median:         1       211     156     0       4       12493
75%-tile:       1       211     156     0       5       18739
97.5%-tile:     1       211     157     0       5       24361
Maximum:        1       211     166     0       8       24985
Mean:   1       211     156.028 0       4.41921
# of unique seqs:       686
total # of seqs:        24985

At this time, I’m afraid I am not aware of the version of MiSeq used. I will find out soon if this is essential.

Thanks in advance for your help, much appreciated.

Kind regards,

Jo

Could you post the summary.seqs results after the align.seqs command? Also, could you post the screen.seqs command you ran with the aligned file?

Apologies - I have just realised that the problems may be arising due to unsuccessful removal of reverse primer in my sequences. Thank you for your assistance to date.