Processors used

Hi everyone,
I’m using mothur for illumina sequence data, the amount of data is huge so I’m using Ohio State University’s supercomputer, I wrote a batch file which seems to be working smoothly. My problem is that I have a limit of 7 days to run a job and it is not finishing on time, when I checked the job report I noticed that it was only using 1 processor out of 12 possible so I thought that by adding the command:
processors=12
To every command it should run much faster but I have checked the job status and it is only using one processor. Am I doing something wrong? Any ideas?

Thanks!

Juan Gonzalez

Here is my batch file:
pcr.seqs(fasta=/nfs/16/osu8334/Nramp_full_mothur/silva.bacteria.fasta, start=1044, end=13127, keepdots=F, processors=12)
trim.seqs(fasta=/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.fasta, oligos=/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina.oligos, processors=12)
screen.seqs(fasta=current, maxambig=0, maxlength=520, minlength=400, processors=12)
unique.seqs()
count.seqs(name=current, group=current, processors=12)
align.seqs(fasta=current, reference=/nfs/16/osu8334/Nramp_full_mothur/silva.bacteria.pcr.fasta, processors=12)
pre.cluster(fasta=current, count=current, diffs=2, processors=12)
unique.seqs(fasta=current, count=current, processors=12)
pre.cluster(fasta=current, count=current, diffs=2, processors=12)
classify.seqs(fasta=current, count=current, reference=/nfs/16/osu8334/Nramp_full_mothur/trainset9_032012.pds.fasta, taxonomy=/nfs/16/osu8334/Nramp_full_mothur/trainset9_032012.pds.tax, cutoff=80, processors=12)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=12)
make.shared(list=current, count=current, label=0.03, processors=12)
classify.otu(list=current, count=current, taxonomy=current, label=0.03, processors=12)
phylotype(taxonomy=current, processors=12)
make.shared(list=current, count=current, label=1, processors=12)
classify.otu(list=current, count=current, taxonomy=current, label=1, processors=12)

Can you provide details on the region you are sequencing, library generation protocol, and the number of reads you are getting out at each step?

It doesn’t look like you’re making contigs? And you aren’t running filter.seqs?

Pat

Hi Pat,

Sequences are of the V1 region using the 27F and 519R, I have a fasta file provided by my vendor where the contigs have already been assembled so I didn’t run make.contigs

Should I run filter.seqs?

According to my provider the DNA library was prepared as follows:

The 16S rRNA gene V1 variable region PCR primers 27F/519R with barcode on the forward primer were used in a 30 cycle PCR using the HotStarTaq Plus Master Mix Kit (Qiagen, USA) under the following conditions: 94°C for 3 minutes, followed by 28 cycles of 94°C for 30 seconds, 53°C for 40 seconds and 72°C for 1 minute, after which a final elongation step at 72°C for 5 minutes was performed. After amplification, PCR products are checked in 2% agarose gel to determine the success of amplification and the relative intensity of bands. Multiple samples are pooled together (e.g., 100 samples) in equal proportions based on their molecular weight and DNA concentrations. Pooled samples are purified using calibrated Ampure XP beads. Then the pooled and purified PCR product is used to prepare DNA library by following Illumina TruSeq DNA library preparation protocol. Sequencing was performed at MR DNA (www.mrdnalab.com, Shallowater, TX, USA) on a MiSeq following the manufacturer’s guidelines. Sequence data were processed using a proprietary analysis pipeline (MR DNA, Shallowater, TX, USA). In summary, sequences were depleted of barcodes then sequences <150bp removed, sequences with ambiguous base calls removed. Sequences were denoised, OTUs generated and chimeras removed. Operational taxonomic units (OTUs) were defined by clustering at 3% divergence (97% similarity). Final OTUs were taxonomically classified using BLASTn against a curated GreenGenes database (DeSantis et al 2006).

The log of the partial job is here: (I removed the numbers to be within the limits of the characters provided here)

mothur > pcr.seqs(fasta=/nfs/16/osu8334/Nramp_full_mothur/silva.bacteria.fasta, start=1044, end=13127, keepdots=F)


Output File Names: /nfs/16/osu8334/Nramp_full_mothur/silva.bacteria.pcr.fasta
It took 26 secs to screen 14956 sequences.

mothur > trim.seqs(fasta=/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.fasta, oligos=/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina.oligos)


Group count: N107 211158 N115 169089 N118 276253 N119 209827 N12 263496 N3 242268 N37 164808 N54 184111 N56 183022 N97 229924 P13 219475 P18 195305 P23 209718 P36 227669 P46 228622 P50 206962 P61 208723 P75 229255 P79 198388 P9 216692 Total of all groups is 4274765

Output File Names:
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.fasta
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.scrap.fasta
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.groups

[WARNING]: your sequence names contained ‘:’. I changed them to ‘_’ to avoid problems in your downstream analysis.

mothur > screen.seqs(fasta=current, maxambig=0, maxlength=520, minlength=400)
Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.fasta as input file for the fasta parameter.



Output File Names: /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.fasta /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.bad.accnos
It took 9655 secs to screen 4274765 sequences.

mothur > unique.seqs()
Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.fasta as input file for the fasta parameter.

Output File Names:
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.names
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.fasta


mothur > count.seqs(name=current, group=current) Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.groups as input file for the group parameter. Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.names as input file for the name parameter.

Using 1 processors.
[ERROR]: processes reported processing 4257321 sequences, but group file indicates you have 4274765 sequences. Could you have a file mismatch?
It took 58 secs to create a table for 4257321 sequences.


Total number of sequences: 4257321

Output File Names:
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.count_table


mothur > align.seqs(fasta=current, reference=/nfs/16/osu8334/Nramp_full_mothur/silva.bacteria.pcr.fasta) Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.fasta as input file for the fasta parameter.

Using 1 processors.

Reading in the /nfs/16/osu8334/Nramp_full_mothur/silva.bacteria.pcr.fasta template sequences… DONE.
It took 35 to read 14956 sequences.
Aligning sequences from /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.fasta …

Total number of sequences before pre.cluster was 134769.
pre.cluster removed 42743 sequences.

It took 14110 secs to cluster 134769 sequences.

Processing group N56:

Total number of sequences before pre.cluster was 132889.
pre.cluster removed 42127 sequences.

It took 13728 secs to cluster 132889 sequences.

Processing group N97:

Total number of sequences before pre.cluster was 157012.
pre.cluster removed 53260 sequences.

It took 19281 secs to cluster 157012 sequences.

Processing group P13:

Total number of sequences before pre.cluster was 152434.
pre.cluster removed 51350 sequences.

It took 16888 secs to cluster 152434 sequences.

Processing group P18:

Total number of sequences before pre.cluster was 132299.
pre.cluster removed 42853 sequences.

It took 13041 secs to cluster 132299 sequences.

Processing group P23:

Total number of sequences before pre.cluster was 146093.
pre.cluster removed 49103 sequences.

It took 15304 secs to cluster 146093 sequences.

Processing group P36:

Total number of sequences before pre.cluster was 155034.
pre.cluster removed 52782 sequences.

It took 18472 secs to cluster 155034 sequences.

Processing group P46:

Total number of sequences before pre.cluster was 160452.
pre.cluster removed 52945 sequences.

It took 19876 secs to cluster 160452 sequences.

Processing group P50:

Total number of sequences before pre.cluster was 140916.
pre.cluster removed 47194 sequences.

It took 14511 secs to cluster 140916 sequences.

Processing group P61:

Total number of sequences before pre.cluster was 137797.
pre.cluster removed 45009 sequences.

It took 14363 secs to cluster 137797 sequences.

Processing group P75:

Total number of sequences before pre.cluster was 158641.
pre.cluster removed 53460 sequences.

It took 19205 secs to cluster 158641 sequences.

Processing group P79:

Total number of sequences before pre.cluster was 136129.
pre.cluster removed 46539 sequences.

It took 13794 secs to cluster 136129 sequences.

Processing group P9:

Total number of sequences before pre.cluster was 155773.
pre.cluster removed 49893 sequences.

It took 18359 secs to cluster 155773 sequences.
It took 340013 secs to run pre.cluster.

Output File Names:
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.align
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.count_table
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N107.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N115.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N118.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N119.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N12.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N3.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N37.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N54.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N56.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.N97.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P13.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P18.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P23.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P36.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P46.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P50.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P61.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P75.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P79.map
/nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.P9.map


mothur > unique.seqs(fasta=current, count=current) Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.count_table as input file for the count parameter. Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.align as input file for the fasta parameter.
Output File Names: /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.unique.count_table /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.unique.align
mothur > pre.cluster(fasta=current, count=current, diffs=2) Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.unique.count_table as input file for the count parameter. Using /nfs/16/osu8334/Nramp_full_mothur/031914BA27Fillumina_full.trim.good.unique.precluster.unique.align as input file for the fasta parameter.

Using 1 processors.

Processing group N107:

Total number of sequences before pre.cluster was 103240.
pre.cluster removed 0 sequences.

It took 14572 secs to cluster 103240 sequences.

Processing group N115:

Total number of sequences before pre.cluster was 85551.
pre.cluster removed 0 sequences.

It took 9641 secs to cluster 85551 sequences.

Processing group N118:

Total number of sequences before pre.cluster was 117571.
pre.cluster removed 0 sequences.

It took 20886 secs to cluster 117571 sequences.

Processing group N119:

Total number of sequences before pre.cluster was 103824.
pre.cluster removed 0 sequences.

It took 14865 secs to cluster 103824 sequences.

Processing group N12:

Total number of sequences before pre.cluster was 112216.
pre.cluster removed 0 sequences.

It took 20972 secs to cluster 112216 sequences.

Processing group N3:

Total number of sequences before pre.cluster was 107956.
pre.cluster removed 0 sequences.

It took 18113 secs to cluster 107956 sequences.

Processing group N37:

Total number of sequences before pre.cluster was 87336.
pre.cluster removed 0 sequences.

It took 11805 secs to cluster 87336 sequences.

Processing group N54:

Total number of sequences before pre.cluster was 92026.
pre.cluster removed 0 sequences.

It took 13490 secs to cluster 92026 sequences.

Processing group N56:

Total number of sequences before pre.cluster was 90762.
pre.cluster removed 0 sequences.

It took 13269 secs to cluster 90762 sequences.

Processing group N97:

Total number of sequences before pre.cluster was 103752.
pre.cluster removed 0 sequences.

It took 18458 secs to cluster 103752 sequences.

Processing group P13:

Total number of sequences before pre.cluster was 101084.
pre.cluster removed 0 sequences.

It took 15600 secs to cluster 101084 sequences.

Processing group P18:

Total number of sequences before pre.cluster was 89446.
pre.cluster removed 0 sequences.

It took 12540 secs to cluster 89446 sequences.

Processing group P23:

Total number of sequences before pre.cluster was 96990.
pre.cluster removed 0 sequences.

It took 14479 secs to cluster 96990 sequences.

Processing group P36:

=>> PBS: job killed: walltime 604815 exceeded limit 604800


Resources requested:
mem=48gb
nodes=1:ppn=12

Resources used:
cput=162:44:48
walltime=168:00:16
mem=46.249 GB
vmem=46.480 GB

Resource units charged (estimate):
201.605 RUs

Estimated RU charges under proposed new accounting policy:
201.605 RUs
See http://osc.edu/memcharging for more information.

Ugh, another MrDNA disaster. I don’t know why they’re using this protocol. It really does no one any good. In the future you should find a new sequence provider. They always seem to find the hardest way to do anything and none of their methods are benchmarked. Avoid them.

As for this dataset… Can you post the output of running summary.seqs with the output of align.seqs? It should end in *.align.

Pat

Sure thing, here it is:

mothur > summary.seqs(fasta=/nfs/16/osu8334/Desktop/20140619_mothur/031914BA27Fillumina_full.trim.good.unique.align, count=/nfs/16/osu8334/Desktop/20140619_mothur/031914BA27Fillumina_full.trim.good.count_table, processors=12)

Using 12 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 5365 4 0 1 1
2.5%-tile: 2 12083 447 0 4 106434
25%-tile: 2 12083 489 0 5 1064331
Median: 2 12083 491 0 5 2128661
75%-tile: 2 12083 495 0 5 3192991
97.5%-tile: 2 12083 519 0 6 4150888
Maximum: 10851 12083 520 0 96 4257321
Mean: 3.41828 12072.6 491.738 0 5.16951

of unique seqs: 2585664

total # of seqs: 4257321

Output File Names:
/nfs/16/osu8334/Desktop/20140619_mothur/031914BA27Fillumina_full.trim.good.unique.summary

Try using more processors in pre.cluster and you’ll run multiple samples at the same time. So your cpu time will be the same, but your wall time will be much less.

Also, I’m afraid I don’t have high hopes for these data making it through much more of the pipeline as the large number of uniques is a symptom of the high error rate you are encountering with the sequencing of the V15 region. We have shown (see Kozich et al.) that the reads must fully overlap to get adequate denoising of the data.

Pat

Thanks Pat, I’ll try that. Is there a sequence provider you could recommend?

You could send your samples to Michigan - we implement the Kozich method. If you email me I can get you the contact information of who to talk to.