Choosing length after first command line

Hello,

I am new to Community analyses using NGS data and new to Mothur. I follow Miseq SOP and its great !
My questions are-

-I have paired end reads from illumina Miseq (2X 250). Size of my amlicons should be around 450 basepairs. How can I check that I got sequences of the correct hypervariable region that I desired ? Secondly, when I started trying Mothur yesterday, and when I did summary. seq after the first step (i.e, make.contigs), I found legth of my sequences very variable-

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 261 261 0 3 1
2.5%-tile: 1 406 406 0 4 4345
25%-tile: 1 409 409 0 4 43443
Median: 1 411 411 1 5 86885
75%-tile: 1 412 412 2 5 130327
97.5%-tile: 1 415 415 11 6 169424
Maximum: 1 503 502 43 208 173768
Mean: 1 410.629 410.629 1.71477 4.74585

of Seqs: 173768


So, how should I choose the desirable length in the very next step ?

Looking forward for the advices.

Richa

Welcome to the mothur community! As you’ll see around the forum, we take a very negative view to sequencing 450 nt fragments using the V2 chemistry since the reads will not fully overlap and you will get a much larger error rate and a lot more headaches than you would by sequencing the V4 region. I would suggest taking a 16S database like those we provide, trimming them to the region that you amplified using something like pcr.seqs and then run the output through summary.seqs. I can tell you that the sequence that is 503 bp is garbage that essentially only has a 1 nt overlap between eh reads.

Pat

Hello,

Thanks for the reply.

I used pcr.seq to remove my oligos and still I get sequence legth of (slightly) variable length.
Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 216 216 0 3 1
2.5%-tile: 1 367 367 0 4 268439
25%-tile: 1 370 370 0 4 2684389
Median: 1 372 372 0 5 5368778
75%-tile: 1 373 373 1 5 8053166
97.5%-tile: 1 376 376 9 6 10469116
Maximum: 1 462 462 56 222 10737554
Mean: 1 371.811 371.811 1.14461 4.71137

of Seqs: 10737554

Following I screened minimum and maximum length

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 367 367 0 3 1
2.5%-tile: 1 368 368 0 4 259856
25%-tile: 1 370 370 0 4 2598558
Median: 1 372 372 0 5 5197115
75%-tile: 1 373 373 1 5 7795672
97.5%-tile: 1 376 376 9 6 10134374
Maximum: 1 376 376 32 113 10394229
Mean: 1 371.84 371.84 1.12945 4.71958

of Seqs: 10394229

Output File Names:
stability.trim.contigs.trim.good.good.summary

I chose to take sequnces of length 367 to 376.

After the stem of unique. seqs,
I am getting 7122169 sequences

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 367 367 0 3 1
2.5%-tile: 1 368 368 0 4 178055
25%-tile: 1 370 370 0 4 1780543
Median: 1 372 372 0 5 3561085
75%-tile: 1 373 373 0 5 5341627
97.5%-tile: 1 376 376 0 6 6944115
Maximum: 1 376 376 0 70 7122169
Mean: 1 371.93 371.93 0 4.67692

of unique seqs: 1166568

total # of seqs: 7122169

This no. of unique sequences are too much and I faced problem very recently at the step of cluster.split (mothur stopped proceeding). What should I do? Is taking minimum and maximum length 367 to 376 if ok ? Please suggest.

My second question is, During another mothur run, at the step of cluter. split, I got following message. Why cutoff from 0.15 changed to 0.05?

mothur > cluster.split(fasta=szn.trim.contigs.trim.good.good.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=szn.trim.contigs.trim.good.good.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, taxonomy=szn.trim.contigs.trim.good.good.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

Using 4 processors.
Using splitmethod fasta.
Splitting the file…
/******************************************/…………………………………………………….

Clustering szn.trim.contigs.trim.good.good.good.unique.good.filter.unique.precluster.pick.pick.fasta.0.dist
Cutoff was 0.155 changed cutoff to 0.05
Cutoff was 0.155 changed cutoff to 0.05
It took 6306 seconds to cluster
Merging the clustered files…
It took 7 seconds to merge.



Looking forward.

Richa

Hello,

Just to add something to my above question.


That I have some libraries that have read quality score below 28. I want to see how they get processed with Mothur on the basis of belief that not all the sequences in in library are bad. There must be many sequences which are of good quality. And I understand that Mothur does quality filtering very well. But I suspect, that are these making the no. of unique sequences high ???

looking forward for suggestion

Richa

Please see http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/