Interpretation of mothur summary.seqs

Hello everyone,
As I am utilizing mothur for analyzing my 16S rRNA data related to V3 and V4 regions, I’m eager to hear your thoughts on the summary results below.

I am particularly concerned about the minimum length mentioned, which is 13 bp. Would it be sensible to set the ‘minlength’ parameter in the ‘makecontigs’ function as minlength = 400? or I could proceed screen sequences with Ambigs = 0 and without setting minlenght? If I opt for a minlength of 400, I might overlook some sequences, which leaves me uncertain about the best course of action. Could you explain what is the meaning of below output, does it mean that we just one contige with 13 bp?!

Your insights would be greatly appreciated.

mothur > summary.seqs(fasta=stability.trim.contigs.pick.fasta, count=stability.contigs.count_table)

Using 56 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	13	13	0	2	1
2.5%-tile:	1	299	299	0	4	106575
25%-tile:	1	444	444	0	4	1065744
Median: 	1	464	464	0	5	2131488
75%-tile:	1	469	469	0	6	3197231
97.5%-tile:	1	470	470	7	7	4156400
Maximum:	1	497	497	64	220	4262974
Mean:	1	448	448	0	5
# of unique seqs:	4262974
total # of seqs:	4262974

It took 162 secs to summarize 4262974 sequences.

Hi there,

The 13 means that the shortest sequence you have has 13 nt. There may only be 1, there may be many. 2.5% of your sequences are 299 nt and shorter.

Are you trimming your sequences before bringing them into mothur? If you are, don’t do that.

You will likely want to extract the V3-V4 region from a reference database and get a sense of the minimum length you can expect. You can find instructions on doing this type of thing here.

When you run screen.seqs, you’ll probably want to do something like…

screen.seqs(fasta=stability.trim.contigs.pick.fasta, count=stability.contigs.count_table, maxambig=0, maxhomop=8, minlength=XXX)

Where XXX is the minimum length you decide upon. You may also want to set a value for maxlength.

Finally, I’d encourage you to consult this blog post on problems with sequencing regions like V3-V4

Pat

Dear Pat,
I greatly appreciate your prompt response. Indeed, I utilized the clean reads with adaptors and primers removed by the respective companies. This analysis can be a unique one as I aim to compare data from two different companies, both targeting the V3 and V4 regions. I intend to harmonize the analysis using a unified protocol.

Your insights are invaluable to me, and I am also facing another challenge that I’d like to discuss with you. Following the alignment, here is a summary of the output:

mothur > summary.seqs(fasta=stability.trim.contigs.pick.good.unique.align, count=stability.trim.contigs.pick.good.count_table)

Using 56 processors.

        Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      12      17122   404     0       4       98610
25%-tile:       12      18996   445     0       4       986098
Median:         12      18996   468     0       5       1972196
75%-tile:       12      18996   469     0       5       2958293
97.5%-tile:     106     18996   470     0       7       3845781
Maximum:        19118   19118   492     0       8       3944390
Mean:   202     18775   451     0       4
# of unique seqs:      1654272
total # of seqs:       3944390

It took 97 secs to summarize 3944390 sequences.

Output File Names:
stability.trim.contigs.pick.good.unique.summary
However, I encountered an error when running pre.cluster():
Processing group BB.M.G3:
[ERROR]: diffs is greater than your sequence length.

Deconvoluting count table results...
[ERROR]: stability.trim.contigs.pick.good.unique.good.filter.unique.precluster.count_table.temp is blank. Please correct.
It took 0 secs to merge 0 sequences group data.

Upon investigating, I found that this error might be related to the start and end values used in the previous step when I ran screen.seqs(). For instance, when I executed screen.seqs() with the parameters start=12 and end=18996, I noticed that samples from one company were excluded.

To address this issue, I plan to run screen.seqs() with adjusted parameters, specifically using start=106 and end=18996, to ascertain if this resolves the problem. However, I seek your guidance on determining the optimal start and end values that would encompass both sets of sequences. Additionally, I am curious as to why selecting start=12 resulted in the exclusion of data from one company. Would you think to start = 106 would be appropriate?

Your insights and expertise in this matter would be greatly appreciated.

Thank you once again for your assistance.

Hi there,

First off… I would super strongly suggest getting the raw data from the companies. We find that trimming sequences before assembly always results in an inferior result in terms of sequencing error rate.

Second… it’s possible the two companies used different primers that start at different positions or that the sequencing runs/bioinformatics were different enough to exclude one of the vendors. Different primers could start/end at different positions. It seems that there are likely too many variables here. Did you give the same library to the companies or did you give them individual DNA aliquots? PCR, library generation, sequencing, and trimming conditions could all have significant impacts on quality. Without holding 3 of those constant, it’s hard to know which factor is causing what you’re seeing.

Thanks,
Pat

Dear Pat,

Thank you sincerely for your recommendation. Yes, after I adjusted the raw data from the company, which hadn’t been trimmed, the length of the shorter sequences increased and reached 246.

mothur > summary.seqs(fasta=stability.trim.contigs.fasta)

Using 56 processors.

    Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:    1       246     246     0       2       1
2.5%-tile:  1       270     270     0       4       203779
25%-tile:   1       444     444     0       4       2037783
Median:     1       468     468     0       5       4075565
75%-tile:   1       470     470     0       6       6113347
97.5%-tile: 1       479     479     20      13      7947351
Maximum:    1       499     499     84      244     8151129
Mean:       1       440     440     1       5
# of Seqs: 8151129

For the next step, I opted for the ‘optimize’ option and encountered no errors:

mothur > screen.seqs(fasta=stability.trim.contigs.good.pick.unique.align, count=stability.trim.contigs.good.pick.count_table, optimize=start-end, criteria=90)

Consequently, I proceeded with the analysis:

mothur > summary.seqs(fasta=stability.trim.contigs.good.pick.unique.good.align, count=stability.trim.contigs.good.pick.good.count_table)

Using 56 processors.

    Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:    2       18996   333     0       3       1
2.5%-tile:  8       18996   444     0       4       145850
25%-tile:   8       18996   453     0       4       1458494
Median:     12      18996   469     0       5       2916987
75%-tile:   12      19116   477     0       6       4375480
97.5%-tile: 12      19118   478     0       6       5688124
Maximum:    3642    19118   499     0       8       5833973
Mean:       12      19043   464     0       4
# of unique seqs:      3075028
total # of seqs:       5833973

Below is the summary of sequences before dis.seq(), which is currently running. Please confirm if this process was executed correctly:

mothur > summary.seqs(fasta=stability.trim.contigs.good.pick.unique.good.filter.unique.precluster.denovo.vsearch.pick.fasta, count=stability.trim.contigs.good.pick.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table)

Using 56 processors.

    Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:    1       1749    285     0       3       1
2.5%-tile:  1       1770    330     0       4       127448
25%-tile:   1       1770    332     0       4       1274472
Median:     1       1770    342     0       4       2548944
75%-tile:   1       1770    343     0       5       3823416
97.5%-tile: 1       1770    343     0       6       4970440
Maximum:    179     1770    423     0       8       5097887
Mean:       1       1769    338     0       4
# of unique seqs:      1153966
total # of seqs:       5097887

Indeed, the samples sent to different companies were not from similar aliquots and were collected in different seasons. While I agree there may be some variables such as using different primers, I thought that by combining the data and analyzing them together, I could reduce these variables to gain insight into the effect of season.

Looking forward to your feedback,
Best

Hi Babak -

Something to look at woud be setting your start and end position in screen.seqs to 12 and 18996, respectively. It appears you have something that snuck through that has a start position of 179 when everythign else was at 1.

Pat

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.