mothur

PCR.seqs in MiSeq SOP

Good day

I am going over my first 16S dataset (I normally work with fungi) and am going over the Miseq SOP. I have a question about the alignment and pcr.seqs step. For the pcr.seqs command, I used the same start and end values as in the SOP, because I used the same reference file. I did however ad my primers used as an oligos file. Here is the code:

mothur > pcr.seqs(fasta=silva.bacteria.fasta, oligos=16S.oligos, keepdots=F, processors=8)

#I downloaded the silva file from mothur. It's just a general reference file for bacteria. The oligos file I made
with just the forward and reverse primers like this:

forward CCTACGGGNGGCWGCAG
reverse GACTACHVGGGTATCTAATCC



It took 36 secs to screen 14956 sequences.

Output File Names: 
silva.bacteria.pcr.fasta
silva.bacteria.bad.accnos
silva.bacteria.scrap.pcr.fasta

mothur > rename.file(input=silva.bacteria.pcr.fasta, new=silva.v4.fasta)

summary.seqs(fasta=silva.v4.fasta)

mothur > summary.seqs(fasta=silva.v4.fasta)

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	17013	382	0	3	1
2.5%-tile:	2	17014	402	0	4	339
25%-tile:	2	17014	405	0	4	3390
Median: 	2	17014	425	0	5	6779
75%-tile:	2	17014	427	0	5	10168
97.5%-tile:	2	17014	428	1	6	13219
Maximum:	2	17063	469	5	9	13557
Mean:	1	17014	417	0	4
# of Seqs:	13557

mothur > align.seqs(fasta=stability.paired.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)


It took 4159 secs to align 3144255 sequences.

[WARNING]: 712 of your sequences generated alignments that eliminated too many bases, a list is provided in stability.paired.trim.contigs.good.unique.flip.accnos.
[NOTE]: 305 of your sequences were reversed to produce a better alignment.

It took 4160 seconds to align 3144255 sequences.

Output File Names: 
stability.paired.trim.contigs.good.unique.align
stability.paired.trim.contigs.good.unique.align.report
stability.paired.trim.contigs.good.unique.flip.accnos

mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.unique.align, count=stability.paired.trim.contigs.good.count_table)

mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.unique.align, count=stability.paired.trim.contigs.good.count_able)

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 0 0 0 0 1 1

2.5%-tile: 2 17014 402 0 4 138038

25%-tile: 2 17014 403 0 4 1380373

Median: 2 17014 411 0 5 2760746

75%-tile: 2 17014 427 0 5 4141118

97.5%-tile: 2 17014 428 0 6 5383453

Maximum: 17063 17063 454 0 118 5521490

Mean: 11 17012 414 0 4

# of unique seqs: 3144255

total # of seqs: 5521490

It took 4969 secs to summarize 5521490 sequences.

Output File Names:

stability.paired.trim.contigs.good.unique.summary

Firstly, that warning message after align.seqs seems off and secondly, when looking at the summary.seqs for the output of align.seqs, the start and end of the sequences do not correspond with the set parameters in pcr.seqs. Could this be because I added the oligos file? Should I take out the oligos file in pcr.seqs and then remove the primers separately after I have done alignment?

Sorry if this is a basic question, but I haven’t done an alignment yet, so I would therefore kindly appreciate feedback regarding this matter.

Best
Nicolas

Hi Nicolas,

The warning message from align.seqs indicates that you have reads that were in the opposite direction from what was expected and/or you have data of poor quality. This would likely fit with the long (>400 nt) contigs compared to the V4 region of the 16S rRNA gene. I’m not sure what you mean by “the sequences do not correspond with the set parameters in pcr.seqs”. You didn’t give it a start or end position, only the primer sequences.

Pat

Dear Pat
Thanks for getting back to me on this. So what I did in the meantime is: I trimmed off the primers before I ran align.seqs, with the .oligos option and then I just followed the rest of the SOP as you’ve outlined. Is that fine?
So with the pcr.seqs command, I first tried it without the oligos file and just with the start and end options, and when I ran summary.seqs after align.seqs, it didn’t match up with the same start and end functions as the command. And again, when I did the primer trimming first prior to pcr.seqs, they didn’t match up with the set parameter. Here, I’ll include my code:

(this is after the screen.seqs step)

mothur > trim.seqs(fasta=stability.paired.trim.contigs.good.fasta, oligos=16S.oligos, pdiffs=0, bdiffs=0, checkorient=T)

Output File Names: 
stability.paired.trim.contigs.good.trim.fasta
stability.paired.trim.contigs.good.scrap.fasta

mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.trim.fasta)

Using 4 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	263	263	0	3	1
2.5%-tile:	1	402	402	0	4	126003
25%-tile:	1	403	403	0	4	1260026
Median: 	1	411	411	0	5	2520052
75%-tile:	1	427	427	0	5	3780078
97.5%-tile:	1	428	428	0	6	4914101
Maximum:	1	462	462	0	81	5040103
Mean:	1	414	414	0	4
# of Seqs:	5040103

It took 714 secs to summarize 5040103 sequences.

Output File Names:
stability.paired.trim.contigs.good.trim.summary

Looking at it on geneious, not everything merged properly, but the trim.seqs worked for sure.

mothur > unique.seqs(fasta=stability.paired.trim.contigs.good.trim.fasta)

Output File Names: 
stability.paired.trim.contigs.good.trim.names
stability.paired.trim.contigs.good.trim.unique.fasta

mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.trim.unique.fasta, name=stability.paired.trim.contigs.good.trim.names)

Using 4 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	263	263	0	3	1
2.5%-tile:	1	402	402	0	4	126003
25%-tile:	1	403	403	0	4	1260026
Median: 	1	411	411	0	5	2520052
75%-tile:	1	427	427	0	5	3780078
97.5%-tile:	1	428	428	0	6	4914101
Maximum:	1	462	462	0	81	5040103
Mean:	1	414	414	0	4
# of unique seqs:	1848492
total # of seqs:	5040103

It took 136 secs to summarize 5040103 sequences.

Output File Names:
stability.paired.trim.contigs.good.trim.unique.summary

count.seqs(name=stability.paired.trim.contigs.good.trim.names, group=stability.paired.contigs.groups)

mothur > count.seqs(name=stability.paired.trim.contigs.good.trim.names, group=stability.paired.contigs.groups)

It took 131 secs to create a table for 5040103 sequences.

Total number of sequences: 5040103

Output File Names: 
stability.paired.trim.contigs.good.trim.count_table

mothur > pcr.seqs(fasta=silva.bacteria.fasta, start=11894, end=25319, keepdots=F, processors=8)

It took 20 secs to screen 14956 sequences.

Output File Names: 
silva.bacteria.pcr.fasta

rename.file(input=silva.bacteria.pcr.fasta, new=silva.v4.fasta)

mothur > summary.seqs(fasta=silva.v4.fasta)

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	13424	270	0	3	1
2.5%-tile:	1	13425	292	0	4	374
25%-tile:	1	13425	293	0	4	3740
Median: 	1	13425	293	0	4	7479
75%-tile:	1	13425	293	0	5	11218
97.5%-tile:	1	13425	294	1	6	14583
Maximum:	3	13425	351	5	9	14956
Mean:	1	13424	292	0	4
# of Seqs:	14956

It took 8 secs to summarize 14956 sequences.

Output File Names:
silva.v4.summary

mothur > align.seqs(fasta=stability.paired.trim.contigs.good.trim.unique.fasta, reference=silva.v4.fasta)

[WARNING]: 32 of your sequences generated alignments that eliminated too many bases, a list is provided in stability.paired.trim.contigs.good.trim.unique.flip.accnos.
[NOTE]: 0 of your sequences were reversed to produce a better alignment.

It took 1697 seconds to align 1848492 sequences.

Output File Names: 
stability.paired.trim.contigs.good.trim.unique.align
stability.paired.trim.contigs.good.trim.unique.align.report
stability.paired.trim.contigs.good.trim.unique.flip.accnos

summary.seqs(fasta=stability.paired.trim.contigs.good.trim.unique.align, count=stability.paired.trim.contigs.good.trim.count_table)
 
 
mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.trim.unique.align, count=stability.paired.trim.contigs.good.trim.count_table)

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	4393	111	0	3	1
2.5%-tile:	1	11546	269	0	4	126003
25%-tile:	1	11546	270	0	4	1260026
Median: 	1	11546	270	0	4	2520052
75%-tile:	1	11546	270	0	5	3780078
97.5%-tile:	1	11546	271	0	6	4914101
Maximum:	1241	13425	297	0	17	5040103
Mean:	1	11545	270	0	4
# of unique seqs:	1848492
total # of seqs:	5040103

It took 2123 secs to summarize 5040103 sequences.

Output File Names:
stability.paired.trim.contigs.good.trim.unique.summary

screen.seqs(fasta=stability.paired.trim.contigs.good.trim.unique.align, count=stability.paired.trim.contigs.good.trim.count_table, summary=stability.paired.trim.contigs.good.trim.unique.summary, start=1, end=11546, maxhomop=8)

I carried on with that all the way to the cluster step, but now at the cluster step, I am having errors. My count table and my .dist file does not seem to match up. But anyways, I would please like to hear your opinion on my approach to the PCR.seqs step before I get to that again.

Thanks in advance!
Best
Nicolas

Hi Nicolas,

It looks like you are using the V4 coordinates, but I don’t think you have V4 data. I suspect you really have V4-V5 data judging by their original lengths of >400 nt. I’m not totally clear what you mean by …

it didn’t match up with the same start and end functions as the command. And again, when I did the primer trimming first prior to pcr.seqs, they didn’t match up with the set parameter.

But I suspect they didn’t correspond with the coordinates on the wiki because the wiki is using the V4 region rather than the V4-V5, which is what you need.

Can you provide the steps between running screen.seqs and where you are running into problems?

Thanks,
Pat

Hey Pat
Thanks for getting back to me again. What I meant with that statement you copied is that after the alignment step, the start and end sequences don’t start where I set the parameter for them to start. Your speculation on the V4-V5 part makes sense. So does that mean I need to set different start and end values based on V4-V5, rather than the start and end values corresponding with V4, like in the Wiki?

I am copying my workflow from after screen.seqs here:

mothur > filter.seqs(fasta=stability.paired.trim.contigs.good.trim.unique.good.align, vertical=T, trump=.)

It took 1044 secs to filter 1845905 sequences.

Length of filtered alignment: 695
Number of columns removed: 12730
Length of the original alignment: 13425
Number of sequences used to construct filter: 1845905

Output File Names:
stability.filter
stability.paired.trim.contigs.good.trim.unique.good.filter.fasta

unique.seqs(fasta=stability.paired.trim.contigs.good.trim.unique.good.filter.fasta, count=stability.paired.trim.contigs.good.trim.count_table)

Output File Names:
stability.paired.trim.contigs.good.trim.unique.good.filter.count_table
stability.paired.trim.contigs.good.trim.unique.good.filter.unique.fasta

pre.cluster(fasta=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.fasta, count=stability.paired.trim.contigs.good.trim.unique.good.filter.count_table, diffs=2)

stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.fasta
stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.count_table

chimera.vsearch(fasta=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.fasta, count=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.count_table, dereplicate=t)

Reading file stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.temp1.temp 100%
1953750 nt in 7235 seqs, min 248, max 280, avg 270
Masking 27%
It took 23 secs to check 7618 sequences from group 75G3.
Masking 100%
Sorting by abundance 100%
Counting k-mers 100%
Detecting chimeras 100%
Found 2285 (31.6%) chimeras, 4858 (67.1%) non-chimeras,
and 92 (1.3%) borderline sequences in 7235 unique sequences.
Taking abundance information into account, this corresponds to
2735 (4.4%) chimeras, 59890 (95.4%) non-chimeras,
and 132 (0.2%) borderline sequences in 62757 total sequences.

It took 23 secs to check 7235 sequences from group 53E5.
It took 546 secs to check 556205 sequences.

Output File Names:
stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table
stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.denovo.vsearch.chimeras
stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.denovo.vsearch.accnos

remove.seqs(fasta=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.fasta, accnos=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.denovo.vsearch.accnos)

Removed 130869 sequences from your fasta file.

Output File Names:
stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.pick.fasta

Try this now:

mothur > cluster(column=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.pick.dist, count=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table)

That didn’t work, try this

mothur > cluster.split(fasta=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.pick.fasta, count=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, taxonomy=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.pds.wang.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.03)

Both commands cluster and cluster.split gave me errors stating that the list values and the fasta values did not correspond to the values in the count table. It gave me one value that was not in the count.table.
More specifically, this is what happens when I try to run the cluster command:
"mothur > cluster(column=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.pick.dist, count=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table)

Using 4 processors.

You did not set a cutoff, using 0.03.

Clustering stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.pick.dist

iter time label num_otus cutoff tp tn fp fn sensitivity specificity ppv npv fdr accuracy mcc f1score

0.03
AAError: Sequence ‘M00313_220_000000000-CRDLT_1_1118_4699_6126’ was not found in the name or count file, please correct"

And when I try to run the split.cluster command:
" mothur > cluster.split(fasta=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.pick.fasta, count=stability.paired.trim.contigs.good.trim.unique.good.filter.uniqe.precluster.denovo.vsearch.pick.count_table, taxonomy=stability.paired.trim.contigs.good.trim.unique.good.filter.unique.precluster.pds.wang.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.03)

Using 4 processors.

Splitting the file…

ERROR: M00313_220_000000000-CRDLT_1_1102_11126_19733 is missing from your fastafile. This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate an error.

ERROR: M00313_220_000000000-CRDLT_1_1102_11290_9108 is missing from your fastafile. This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate an error.

ERROR: M00313_220_000000000-CRDLT_1_1102_13546_14638 is missing from your fastafile. This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate an error.

ERROR: M00313_220_000000000-CRDLT_1_1102_14327_13893 is missing from your fastafile. This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate an error.

ERROR: M00313_220_000000000-CRDLT_1_1102_16551_3122 is missing from your fastafile. This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate an erro"

This error message just repeats for many lines.

Thanks in advance.
Best
Nicolas

Hi Nicolas,

You can find instructions on customizing the reference to your region of interest by going to Customize your reference alignment for your favorite region and following the instructions there.

Can you re-run classify.seqs and dist.seqs on the output of remove.seqs? I wonder if something weird happened and perhaps you accidentally skipped it when you were running these commands before running cluster and cluster.split.

Pat

Dear Pat
Again, thanks for all the tips! The link you sent me on customizing my reference alignment for your favorite region is useful, but I’m not 100% if I followed it correctly. Everything seemed to work fine, until I reached the pre.cluster step with my new alignments. At first I tried running pre.cluster with the count.table, but I kept on getting errors. Then I tried it with the group.file, and I got similar errors again.

Sorry if this takes up too much of your time, I’m going to try and set up a meeting with a bioinformatician to troubleshoot this with me. But again, if you have the time, I would kindly appreciate your input.

This is what I did following the link that you sent me.

mothur > align.seqs(fasta=stability.paired.trim.contigs.good.unique.fasta, reference=silva.nr_v138.align)

[WARNING]: 893 of your sequences generated alignments that eliminated too many bases, a list is provided in stability.paired.trim.contigs.good.unique.flip.accnos.
[NOTE]: 485 of your sequences were reversed to produce a better alignment.

It took 23506 seconds to align 3144255 sequences.

Output File Names:
stability.paired.trim.contigs.good.unique.align
stability.paired.trim.contigs.good.unique.align.report
stability.paired.trim.contigs.good.unique.flip.accnos

mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.unique.align)

Using 4 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 0 0 0 0 1 1
2.5%-tile: 6388 25316 440 0 4 78607
25%-tile: 6388 25316 441 0 5 786064
Median: 6388 25316 455 0 5 1572128
75%-tile: 6388 25316 465 0 6 2358192
97.5%-tile: 6388 25316 466 0 7 3065649
Maximum: 43116 43116 500 0 136 3144255
Mean: 6406 25326 452 0 5

of Seqs: 3144255

It took 16496 secs to summarize 3144255 sequences.

Output File Names:
stability.paired.trim.contigs.good.unique.summary

mothur > pcr.seqs(fasta=silva.nr_v138.align, start=6388, end=25316, keepdots=F, processors=8)

It took 235 secs to screen 146796 sequences.

Output File Names:
silva.nr_v138.pcr.align

mothur > rename.file(input=silva.nr_v138.pcr.align, new=silva.v4.fasta)

mothur > summary.seqs(fasta=silva.v4.fasta)

Using 8 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 15382 256 0 3 1
2.5%-tile: 1 18928 436 0 4 3670
25%-tile: 1 18928 441 0 4 36700
Median: 1 18928 461 0 5 73399
75%-tile: 1 18928 464 0 6 110098
97.5%-tile: 1 18928 606 1 7 143127
Maximum: 3865 18928 1666 5 16 146796
Mean: 1 18927 468 0 5

of Seqs: 146796

It took 159 secs to summarize 146796 sequences.

Output File Names:
silva.v4.summary

mothur > align.seqs(fasta=stability.paired.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)

It took 16423 secs to align 3144255 sequences.

[WARNING]: 633 of your sequences generated alignments that eliminated too many bases, a list is provided in stability.paired.trim.contigs.good.unique.flip.accnos.
[NOTE]: 385 of your sequences were reversed to produce a better alignment.

It took 16430 seconds to align 3144255 sequences.

Output File Names:
stability.paired.trim.contigs.good.unique.align
stability.paired.trim.contigs.good.unique.align.report
stability.paired.trim.contigs.good.unique.flip.accnos

summary.seqs(fasta=stability.paired.trim.contigs.good.unique.align)

mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.unique.align)

Using 4 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 0 0 0 0 1 1
2.5%-tile: 1 18928 439 0 4 78607
25%-tile: 1 18928 440 0 5 786064
Median: 1 18928 453 0 5 1572128
75%-tile: 1 18928 464 0 6 2358192
97.5%-tile: 1 18928 465 0 7 3065649
Maximum: 18928 18928 498 0 135 3144255
Mean: 17 18925 451 0 5

of Seqs: 3144255

It took 6146 secs to summarize 3144255 sequences.

Output File Names:
stability.paired.trim.contigs.good.unique.summary

In the MIseqSOP they use the count.table as well. I’m just a little concerned that the start and end sequences in the summary.table does not reflect
what I set it was in the pcr.seqs command

mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.unique.align, count=stability.paired.trim.contigs.good.count_table)

Using 4 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 0 0 0 0 1 1
2.5%-tile: 1 18928 439 0 4 138038
25%-tile: 1 18928 440 0 5 1380373
Median: 1 18928 448 0 5 2760746
75%-tile: 1 18928 464 0 6 4141118
97.5%-tile: 1 18928 465 0 6 5383453
Maximum: 18928 18928 498 0 135 5521490
Mean: 10 18926 451 0 5

of unique seqs: 3144255

total # of seqs: 5521490

It took 6984 secs to summarize 5521490 sequences.

Output File Names:
stability.paired.trim.contigs.good.unique.summary

mothur > screen.seqs(fasta=stability.paired.trim.contigs.good.unique.align, count=stability.paired.trim.contigs.good.count_table, maxhomop=8)

It took 6472 secs to screen 3144255 sequences, removed 5688.

/******************************************/
Running command: remove.seqs(accnos=stability.paired.trim.contigs.good.unique.bad.accnos.temp, count=stability.paired.trim.contigs.good.count_table)
Removed 5791 sequences from your count file.

Output File Names:
stability.paired.trim.contigs.good.pick.count_table

/******************************************/

Output File Names:
stability.paired.trim.contigs.good.unique.good.align
stability.paired.trim.contigs.good.unique.bad.accnos
stability.paired.trim.contigs.good.good.count_table

mothur > filter.seqs(fasta=stability.paired.trim.contigs.good.unique.good.align, vertical=T, trump=.)

Length of filtered alignment: 0
Number of columns removed: 18928
Length of the original alignment: 18928
Number of sequences used to construct filter: 3138567

Output File Names:
stability.filter
stability.paired.trim.contigs.good.unique.good.filter.fasta

mothur > unique.seqs(fasta=stability.paired.trim.contigs.good.unique.good.filter.fasta, count=stability.paired.trim.contigs.good.good.count_table)

Output File Names:
stability.paired.trim.contigs.good.unique.good.filter.count_table
stability.paired.trim.contigs.good.unique.good.filter.unique.fasta

mothur > pre.cluster(fasta=stability.paired.trim.contigs.good.unique.good.filter.unique.fasta, group=stability.paired.contigs.groups, diffs=2)

And then here is where I keep on getting error messages.