Hi, all
I just did an illumina 300bp pair-end Miseq run for my soil samples.
Following the Miseq SOP, after the first uniques.seqs conmmand, I got 9518594 uniques sequences of total 9553985 sequences. there are 16 groups. Is this generally the case or I did something wrong. Because the following command like align.seqs, pre.cluster take so long to finish, it took 24 hour to align and took several days to precluster about 9 million sequences.
I am so frustrated by the time, hope I could get some suggestions from here, thank you.
Dear Wendy,
I am glad you raised the question. We have also just started analyzing our data (from Gobi desert), and we have found that after the
initial unique.seqs command there was a slight difference in the number of sequence. Where did you sample?
Also looking forward to getting some answers on this matter, we are very eager to continue with our work.
Kind regards,
itu
What MiSeq sequencing approach are you using? What region are you sequencing? We have found that if your reads do not fully overlap (e.g. V35) you will get what you are seeing, a ton of uniques. This is because there will be a ton of sequencing errors that aren’t denoised like when the reads do fully overlap (e.g. V4). Can you provide more details?
Hi,
Thanks for the reply. My samples are from one field,different plots.
I got the fastq files from the Miseq, which were two files for each sample with the barcode demultiplexed.
I sequence the V5+V6 region. the following is the summary of my analyses. I am pretty sure there are something wrong, because it should not take several days to finish pre.cluster on 32 processors
mothur > make.contigs(file=second.file,gapextend=0,trimoverlap=T,processors=32)
mothur > summary.seqs(fasta=current)
Using second.trim.contigs.fasta as input file for the fasta parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1 1 0 1 1
2.5%-tile: 1 94 94 0 4 273984
25%-tile: 1 288 288 0 5 2739831
Median: 1 290 290 0 5 5479662
75%-tile: 1 292 292 0 5 8219493
97.5%-tile: 1 296 296 2 6 10685340
Maximum: 1 303 303 125 103 10959323
Mean: 1 283.073 283.073 0.264236 5.06728
of Seqs: 10959323
I did the screen, unique,align,this is the summary after align.
mothur > summary.seqs(fasta=current,name=second.trim.contigs.good.names)
Using second.trim.contigs.good.unique.align as input file for the fasta parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 580 12131 285 0 4 238850
25%-tile: 581 12131 289 0 5 2388497
Median: 583 12333 290 0 5 4776993
75%-tile: 585 12333 292 0 5 7165489
97.5%-tile: 585 12334 296 0 6 9315136
Maximum: 21116 21116 301 0 8 9553985
Mean: 676.445 12248.3 288.668 0 5.10696
of unique seqs: 9518594
total # of seqs: 9553985
after the filter ,unique, i still have 9050997 unique sequences. it looks forever to finish the pre.cluster.
I am very appreciate someone could help to find out what is the problem.
Can you send the exact commands you’re running for the whole thing? Also, why are you changing the gapextend parameter? I don’t know that it will make a difference, but could you try again without changing it from the default?
Also, how many samples do you have?
Hi,
Thanks so much. I have only 16 samples,and they are all from one filed soil.
At first, I used those commands:
mothur > make.contigs(file=second.file, processors=32)
mothur > summary.seqs(fasta=current)
Using second.trim.contigs.fasta as input file for the fasta parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 35 35 0 2 1
2.5%-tile: 1 285 285 0 4 274515
25%-tile: 1 299 299 0 5 2745142
Median: 1 301 301 0 5 5490284
75%-tile: 1 301 301 0 5 8235426
97.5%-tile: 1 313 313 3 7 10706053
Maximum: 1 603 602 297 301 10980567
Mean: 1 296.78 296.779 1.05302 5.67452
of Seqs: 10980567
mothur > screen.seqs(fasta=current,name=current,group=current,maxambig=0,minlength=100,maxlength=330,maxhomop=8)
mothur > summary.seqs(fasta=current,name=current)
Using second.trim.contigs.good.fasta as input file for the fasta parameter.
[WARNING]: no file was saved for name parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 100 100 0 3 1
2.5%-tile: 1 289 289 0 4 238670
25%-tile: 1 299 299 0 5 2386691
Median: 1 300 300 0 5 4773382
75%-tile: 1 301 301 0 5 7160072
97.5%-tile: 1 311 311 0 6 9308093
Maximum: 1 330 330 0 8 9546762
Mean: 1 299.667 299.667 0 5.12169
of Seqs: 9546762
mothur > unique.seqs(fasta=current,name=current)
mothur > summary.seqs(fasta=current,name=current)
Using second.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Using second.trim.contigs.good.names as input file for the name parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 100 100 0 3 1
2.5%-tile: 1 289 289 0 4 238670
25%-tile: 1 299 299 0 5 2386691
Median: 1 300 300 0 5 4773382
75%-tile: 1 301 301 0 5 7160072
97.5%-tile: 1 311 311 0 6 9308093
Maximum: 1 330 330 0 8 9546762
Mean: 1 299.667 299.667 0 5.12169
of unique seqs: 9546593
total # of seqs: 9546762
:roll: and I just realized that pretty much they are all unique sequences, so I started over from the make.contigs, use trimoverlap=T, I changed the gapextend=0 because there are one paper suggested, my whole command as follows:
mothur > make.contigs(file=second.file,gapextend=0,trimoverlap=T,processors=32)
Group count:
BNT1-1 650723
BNT1-2 527043
BNT1-3 450401
BNT1-4 597384
BNT5-1 699643
BNT5-2 728870
BNT5-3 677614
BNT5-4 1047991
MC103 1053346
MC105 1087881
RNT1-1 904319
RNT1-4 630049
RNT5-1 871943
RNT5-4 1032116
Total of all groups is 10959323
mothur > summary.seqs(fasta=current)
Using second.trim.contigs.fasta as input file for the fasta parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1 1 0 1 1
2.5%-tile: 1 94 94 0 4 273984
25%-tile: 1 288 288 0 5 2739831
Median: 1 290 290 0 5 5479662
75%-tile: 1 292 292 0 5 8219493
97.5%-tile: 1 296 296 2 6 10685340
Maximum: 1 303 303 125 103 10959323
Mean: 1 283.073 283.073 0.264236 5.06728
of Seqs: 10959323
mothur > screen.seqs(fasta=current,group=current,maxambig=0,minlength=100,maxlength=320,maxhomop=8)
mothur > summary.seqs(fasta=current)
Using second.trim.contigs.good.fasta as input file for the fasta parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 100 100 0 3 1
2.5%-tile: 1 285 285 0 4 238850
25%-tile: 1 289 289 0 5 2388497
Median: 1 290 290 0 5 4776993
75%-tile: 1 292 292 0 5 7165489
97.5%-tile: 1 296 296 0 6 9315136
Maximum: 1 301 301 0 8 9553985
Mean: 1 289.547 289.547 0 5.11397
of Seqs: 9553985
mothur > unique.seqs(fasta=current)
mothur > summary.seqs(fasta=current,name=current)
Using second.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Using second.trim.contigs.good.names as input file for the name parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 100 100 0 3 1
2.5%-tile: 1 285 285 0 4 238850
25%-tile: 1 289 289 0 5 2388497
Median: 1 290 290 0 5 4776993
75%-tile: 1 292 292 0 5 7165489
97.5%-tile: 1 296 296 0 6 9315136
Maximum: 1 301 301 0 8 9553985
Mean: 1 289.547 289.547 0 5.11397
of unique seqs: 9518594
total # of seqs: 9553985
mothur > pcr.seqs(fasta=silva.bacteria.fasta,start=22000,end=43120,keepdots=F)
mothur > summary.seqs(fasta=current)
Using silva.bacteria.pcr.fasta as input file for the fasta parameter.
Using 1 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 45 21061 729 0 4 1
2.5%-tile: 46 21116 753 0 4 374
25%-tile: 46 21116 769 0 5 3740
Median: 46 21116 772 0 5 7479
75%-tile: 46 21116 774 0 5 11218
97.5%-tile: 46 21116 783 2 7 14583
Maximum: 47 21116 1003 5 9 14956
Mean: 46.0003 21116 770.643 0.175314 5.20988
of Seqs: 14956
mothur > align.seqs(fasta=second.trim.contigs.good.unique.fasta,reference=silva.bacteria.pcr.fasta)
It took 63303 secs to align 9518594 sequences.
mothur > summary.seqs(fasta=current,name=second.trim.contigs.good.names)
Using second.trim.contigs.good.unique.align as input file for the fasta parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 580 12131 285 0 4 238850
25%-tile: 581 12131 289 0 5 2388497
Median: 583 12333 290 0 5 4776993
75%-tile: 585 12333 292 0 5 7165489
97.5%-tile: 585 12334 296 0 6 9315136
Maximum: 21116 21116 301 0 8 9553985
Mean: 676.445 12248.3 288.668 0 5.10696
of unique seqs: 9518594
total # of seqs: 9553985
mothur > screen.seqs(fasta=current,group=second.contigs.good.groups,start=585,end=12131,processors=32)
mothur > summary.seqs(fasta=current,name=current)
Using second.trim.contigs.good.unique.good.align as input file for the fasta parameter.
Using second.trim.contigs.good.good.names as input file for the name parameter.
Using 32 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 545 12131 271 0 3 1
2.5%-tile: 580 12131 285 0 4 231258
25%-tile: 581 12131 289 0 5 2312575
Median: 583 12333 290 0 5 4625150
75%-tile: 585 12333 292 0 5 6937725
97.5%-tile: 585 12334 296 0 6 9019042
Maximum: 585 12479 301 0 8 9250299
Mean: 582.466 12234.6 290.45 0 5.1249
of unique seqs: 9234296
total # of seqs: 9250299
mothur > filter.seqs(fasta=current,vertical=T,trump=.)
othur > unique.seqs(fasta=current,name=current)
9234296 9050997
Output File Names:
second.trim.contigs.good.unique.good.filter.names
second.trim.contigs.good.unique.good.filter.unique.fasta
mothur > pre.cluster(fasta=current,name=current,group=current,diffs=2)
:geek: It took three days now, but still have not finished the pre.cluster
Sorry, but I’m not sure what to tell you except that I think you must have a high error rate in your reads for some reason. Was any PhiX used in your run?
I would get rid of the gapextend=0; however, I don’t know that will make any difference (who suggested that anyway?)
I’m pretty sure that you’ll never get this through cluster.split because the distance matrix will be too huge. The alternative is to just classify everything to phylotypes and proceed from there.
Sorry!
Pat
Hi, Pat
Thank you very much.
I tried to trim the primer sequences after make.contigs, and I found that there were not tons of unique sequences left, and the length looks all right, so I guess there must be high error rate of my primer sequences. I want to try again and see whether it is make sense this way.