How to decide maxlength?

Hi mothur community,

Many thanks for maintaining the forum, and tool. I’m a new guy to mothur, and 16s data. Kindly bear with me and my naive questions. :oops:
I’ve illumina- MiSeq, dual indexed, paired end, V3-V4 region, demultiplexed reads. They do not have bar codes in them. I’m yet to receive linker/primer information. Read length is 301.
I took a step ahead with one sample. created stability file, ran mothur. So far so good.
Command ran:

mothur "#summary.seqs(fasta=stability.trim.contigs.fasta)"

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 35 35 0 3 1
2.5%-tile: 1 443 443 0 4 2597
25%-tile: 1 446 446 0 4 25961
Median: 1 464 464 2 5 51922
75%-tile: 1 464 464 6 5 77883
97.5%-tile: 1 598 598 22 7 101247
Maximum: 1 602 602 45 300 103843
Mean: 1 463.875 463.875 4.49172 4.84329

of Seqs: 103843

Clearly “602” didn’t assemble.

screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)

Questions:

  1. How do I decide the max length parameter I should be using in next step to filter out my sequences?
  2. Does Maximum: 103843 (NumSeqs) mean I have 10k-ish sequences with length 602?
  3. Has Mothur added feature to take in .fastq.gz files?

I’ll be grateful to you for helping me proceed.

  1. How do I decide the max length parameter I should be using in next step to filter out my sequences?

I’d take the silva dataset and run it through pcr.seqs with your primers and then run that output through summary.seqs. What type of variation do you see?

  1. Does Maximum: 103843 (NumSeqs) mean I have 10k-ish sequences with length 602?

Actually, you have 100k-ish sequences that are 602 nt or shorter

  1. Has Mothur added feature to take in .fastq.gz files?

This will be in the next release.

Also, you may want to see this, if you’re new to this stuff and aren’t sequencing the V4 region…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Pat

Hi Dr. Pat,

Thanks for your reply. Kindly bear with me over this doubt clearing.

V3-V4 region (~460 bp) has been sequenced, with PhiX 5% using http://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/16s/16s-metagenomic-library-prep-guide-15044223-b.pdf

As suggested, I went ahead with pcr.seqs with my oligo file. Data has already been demultiplexed, I didn’t provide barcode in it. Took bacterial database from http://www.mothur.org/wiki/Silva_reference_files

Read length 301, the data quality starts dropping after 230 bp length. Sigh.

I’ve started with 4 samples.
mothur “#make.contigs(file=stability.files, oligos=oligos_sample.txt,processors=8)”

Summary of mothur “#summary.seqs(fasta=stability.trim.contigs.fasta,processors=8)”

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 39 39 0 3 1
2.5%-tile: 1 408 408 0 4 6515
25%-tile: 1 412 412 1 4 65148
Median: 1 428 428 2 5 130296
75%-tile: 1 429 429 7 6 195443
97.5%-tile: 1 551 551 22 8 254076
Maximum: 1 566 566 45 113 260590
Mean: 1 431.174 431.174 4.68189 5.04415

of Seqs: 260590

mothur “#pcr.seqs(fasta=silva.bacteria.fasta, oligos=oligos_sample.txt,processors=8)”

Summary of mothur “#summary.seqs(fasta=silva.bacteria.pcr.fasta,processors=8)”

Start End NBases Ambigs Polymer NumSeqs
Minimum: 6424 23440 384 0 3 1
2.5%-tile: 6426 23443 404 0 4 343
25%-tile: 6426 23443 407 0 4 3421
Median: 6426 23443 427 0 5 6842
75%-tile: 6426 23443 429 0 5 10262
97.5%-tile: 6426 23443 430 1 6 13340
Maximum: 6428 23490 471 5 9 13682
Mean: 6426 23443 419.509 0.093042 4.85901

of Seqs: 13682

75% of my PCR screening have 429 as the max length.
I hope I ran pcr.seqs correctly.

I’d request to help me understand how, and what the above stats help, impact downstream analysis. But that would be once for all. Based on this numbers, how I pick my max_length parameter so not to include a huge amount of garbage, and not to leave reads which might be of importance. I’ve gone through the explanation in MiSeq_SOP, still on a broader level I lack the understanding.

“Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime.” :geek:

Your problem really is that the reads do not fully overlap and that the V3 chemistry is garbage:

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Both of these problems are causing the large number of Ns.

Regardless, I would probably use something like 490 as your maximum length.

The upshot of all of this is that with the high error rate data that you have, the number of OTUs will be artificially inflated and communities will look more different from each other than they really are.

Pat

Hi Dr. Pat,

Many thanks for the guidelines, and sharing vital points on V3 chemistry. :slight_smile: