Align.seqs - queries.

jopollock · January 16, 2014, 4:06pm

Hello,

I have taken one of my samples through the MiSeq SOP, and have some queries surrounding the align.seqs command.

I am unsure as to how to map my V3 product to the reference, so I did not specify start and end parameters before running align.seqs. Below is my summary table:

mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.align,count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table)


Using 1 processors.

                  Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0         0         0            0             1       1
2.5%-tile:        2         7450    156         0            4       23021
25%-tile:         2         7450    156         0            4       230205
Median:          2         7450    156         0            4       460409
75%-tile:         2         7450    156         0            5       690613
97.5%-tile:      2         7452    156         0            5       897796
Maximum:      21228   21228   200        0            27      920816
Mean:         9.64788 7450.93 155.908   0       4.42693
# of unique seqs:       131351
total # of seqs:        920816

Output File Names:
1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.summary

I have a couple of questions:

(a) What does the start and end position in the above table relate to? I assumed that it meant the start and end position in the alignment, but obviously this doesn’t add up to ~ 156bp. Apologies for this potentially “stupid” question.
(b) I lose around 80% of my reads during align.seqs. Would this having something to do with not narrowing down the reference alignment, or for some other reason unknown to me?

In comparison with the number of sequences lost in the tutorial, I seem to be losing a huge number of sequences during the quality control part of the workflow. Obviously I fear that this will be skewing my final result (i.e. I have a huge abundance of Clostridium species in my digesta samples, which is unexpected).

Thanks for any advice provided,

Jo

pschloss · January 20, 2014, 2:07pm

(a) What does the start and end position in the above table relate to? I assumed that it meant the start and end position in the alignment, but obviously this doesn’t add up to ~ 156bp. Apologies for this potentially “stupid” question.

It is in the alignment. Remember that there are gaps in the alignment and so that is why End-Start > Nbases.

(b) I lose around 80% of my reads during align.seqs. Would this having something to do with not narrowing down the reference alignment, or for some other reason unknown to me?

Not sure what you mean. Can you post the pre and post summary.seqs output? There is nothing in align.seqs that throws out reads.

In comparison with the number of sequences lost in the tutorial, I seem to be losing a huge number of sequences during the quality control part of the workflow. Obviously I fear that this will be skewing my final result (i.e. I have a huge abundance of Clostridium species in my digesta samples, which is unexpected).

I’m afraid we need more specifics. summary.seqs output, exact commands you’re running, and version of MiSeq.

jopollock · January 21, 2014, 8:51am

Hello. Thank you for your reply.

I lose around 80% of my reads during align.seqs. Would this having something to do with not narrowing down the reference alignment, or for some other reason unknown to me?

Since I do not know where my amplicon sits in the alignment, I ran the following command for align.seqs:

align.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.fasta, reference=…/bin/example_data/silva.bacteria/silva.bacteria.pcr.fasta)

The summary.seqs output prior to and after running the above command are as follows:

mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table)


Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       150     150     0       3       1
2.5%-tile:      1       177     177     0       4       47798
25%-tile:       1       177     177     0       4       477973
Median:         1       177     177     0       4       955945
75%-tile:       1       177     177     0       5       1433917
97.5%-tile:     1       177     177     0       5       1864091
Maximum:        1       200     200     0       27      1911888
Mean:   1       176.997 176.997 0       4.43096
# of unique seqs:       655372
total # of seqs:        1911888

mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.align, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table)


Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      2       7450    156     0       4       23021
25%-tile:       2       7450    156     0       4       230205
Median:         2       7450    156     0       4       460409
75%-tile:       2       7450    156     0       5       690613
97.5%-tile:     2       7452    156     0       5       897796
Maximum:        21228   21228   200     0       27      920816
Mean:   9.64788 7450.93 155.908 0       4.42693
# of unique seqs:       131351
total # of seqs:        920816

In comparison with the number of sequences lost in the tutorial, I seem to be losing a huge number of sequences during the quality control part of the workflow. Obviously I fear that this will be skewing my final result (i.e. I have a huge abundance of Clostridium species in my digesta samples, which is unexpected).

As previously mentioned, I am concerned that I am losing too many reads throughout the quality control process. I will try to highlight this below:

mothur > summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.fasta)

Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       150     150     0       3       1
2.5%-tile:      1       177     177     0       4       75485
25%-tile:       1       177     177     0       4       754849
Median:         1       177     177     0       5       1509698
75%-tile:       1       265     265     0       10      2264547
97.5%-tile:     1       269     269     3       15      2943911
Maximum:        1       301     300     117     150     3019395
Mean:   1       206.669 206.668 0.285542        6.71277
# of Seqs:      3019395

screen.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.fasta, group=1169N0001_ATCACG_L001_R1_001.contigs.groups, maxambig=0, maxlength=200)
unique.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.fasta)
count.seqs(name=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.names, group=1169N0001_ATCACG_L001_R1_001.contigs.good.groups)

See previous summary.seqs tables for prior and after align.seqs output.

screen.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.align, 
count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.count_table, summary=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.summary, start=2, end=7452, maxhomop=8
summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.align, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.good.count_table)

Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        2       7452    155     0       3       1
2.5%-tile:      2       7452    156     0       4       625
25%-tile:       2       7452    156     0       4       6249
Median:         2       7452    156     0       4       12498
75%-tile:       2       7452    156     0       5       18746
97.5%-tile:     2       7452    157     0       5       24370
Maximum:        2       8547    179     0       8       24994
Mean:   2       7452.89 156.05  0       4.42026
# of unique seqs:       5247
total # of seqs:        24994

filter.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.align, vertical=T, trump=.)

unique.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.good.count_table)

pre.cluster(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.count_table, diffs=2)

chimera.uchime(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

remove.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.uchime.accnos)

summary.seqs(fasta=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=1169N0001_ATCACG_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, processors=3)

Using 3 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       209     154     0       3       1
2.5%-tile:      1       211     156     0       4       625
25%-tile:       1       211     156     0       4       6247
Median:         1       211     156     0       4       12493
75%-tile:       1       211     156     0       5       18739
97.5%-tile:     1       211     157     0       5       24361
Maximum:        1       211     166     0       8       24985
Mean:   1       211     156.028 0       4.41921
# of unique seqs:       686
total # of seqs:        24985

At this time, I’m afraid I am not aware of the version of MiSeq used. I will find out soon if this is essential.

Thanks in advance for your help, much appreciated.

Kind regards,

Jo

westcott · January 22, 2014, 1:46pm

Could you post the summary.seqs results after the align.seqs command? Also, could you post the screen.seqs command you ran with the aligned file?

jopollock · January 30, 2014, 1:37pm

Apologies - I have just realised that the problems may be arising due to unsuccessful removal of reverse primer in my sequences. Thank you for your assistance to date.

Topic		Replies	Views
strange behaviour align.seqs with samples from same run Commands in mothur	5	4982	November 20, 2014
align.seqs and no of bases Commands in mothur	5	2986	January 16, 2015
Start and End differs after remove.seqs compared to screen.seqs Commands in mothur	4	592	July 20, 2020
Align.seqs removing most basepairs Commands in mothur	1	2064	August 15, 2013
Blank file after summary.seqs mothur bugs	24	12067	July 2, 2015

Align.seqs - queries.

Related topics