Problem with Full Data Set

My name is Brittany Jones, I am a research assistant at Southern Illinois University in Carbondale, IL. I am working with an environmental microbial data set for our Acid Mine Drainage project however, I have hit a snag in processing the data set.

make.contigs(file=dnr.txt, processors=20)
summary.seqs(fasta=current)
screen.seqs(fasta=current, group=current, maxamig=0, maxlength=305)
summary.seqs(fasta=current)
get.current()
unique.seqs(fasta=current)
count.seqs(name=current, group=current)
summary.seqs(count=current)
pcr.seqs(fasta=silva.nr_v123.align, start=8000, end=27000, keepdots=F, processors=20)
system(mv silva.nr_v123.pcr.align silva.v4.align)
summary.seqs(fasta=silva.v4.align)
align.seqs(fasta=dnr.trim.contigs.good.unique.fasta, reference=silva.n4.align)
summary.seqs(fasta=current, count=current)
screen.seqs(fasta=current, count=current, summary=current, start=2368, end=17316, maxhomop=8)
[ERROR]: Could not open dnr.trim.contigs.good.unique.align7029.num.temp
Was repeated for multiple align#.num.temp files
[ERROR]: found 451212 sequences in your fast file, and 3007015 sequences in your summary file, quitting.
summary.seqs(fasta=current, count=current)
get.current()
filter.seqs(fasta=current, vertical=T, trump=., processors=20)
Creating Filter…
[ERROR]: Sequences are not al the same length, please correct.
was repeated multiple times
unique.seqs(fasta=current, count=current)
[ERROR]: Could not open dnr.trim.contigs.good.unique.count_table
summary.seqs(fasta=current, count=current)
Unable to open dnr.trim.contigs.good.unique.count_table. Trying default /share/apps/mothur-1.37.6/dnr.trim.contigs. etc…
Unable to open /share/apps/mothur-1.37.6/dnr.trim.contigs.good.unique.count_table
[WARNING]: This command can take a name file and you did not provide one. The current name file is dnr.trim.contigs.good.names which
seems to match dnr.trim.contigs.good.unique.uniqe.align
[ERROR]: Did not complete summary.seqs

Script File Continues with more errors, but the problem seems to start in the lines of commands above. When I break the data set down to 4 or 5 samples at a time, it works no problem. However, if I run all 17 of the samples together this is the errors I get. I’m also using Mothur-1.37.6 and Silva v123. However, our collaborator is getting the same error and he is using the updated version of Mothur and the newer Silva v128. I am using HPC resources, so computational power and memory are not issues. We noticed that the unique.seqs command does not reduce the dataset greatly.

Has anyone else come across this problem before? Or anyone have any ideas we can get our data to process as a whole set? (We are using the V4 region of the 16S rRNA).

Thanks,
Brittany

I suspect you had errors when running align.seqs with 20 processors. Can you rerun that with ~4 processors and go from there?

Pat

I do believe I can change the processors to ~4. (Do I just add processors=4 to that command line in my file?)

I submitted our data set again with your suggested change, and it now has a new error. Wondering if I need to change every command from align.seqs on to utilizing ~4 processors?

mothur > align.seqs(fasta=dnr.trim.contigs.good.unique.fasta, reference=silva.v4.align, processors=4)

It took 71 to read 172418 sequences.
Aligning sequences from dnr.trim.contigs.good.unique.fasta …
Some of your sequences generated alignments that eliminated too many bases, a list is provided in dnr.trim.contigs.good.unique.flip.accnos. If you set the flip parameter to true mothur will try aligning the reverse compliment as well.
It took 17482 secs to align 5103189 sequences.

Output File Names:
dnr.trim.contigs.good.unique.align
dnr.trim.contigs.good.unique.align.report
dnr.trim.contigs.good.unique.flip.accnos


[b]mothur > summary.seqs(fasta=current, count=current)[/b] Using dnr.trim.contigs.good.count_table as input file for the count parameter. Using dnr.trim.contigs.good.unique.align as input file for the fasta parameter.

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 2360 17318 13 0 2 133089
25%-tile: 2366 17318 299 0 5 1330883
Median: 2369 17318 301 0 6 2661766
75%-tile: 18996 26501 444 0 26 3992648
97.5%-tile: 0 0 0 0 1 5190442
Maximum: 18996 26501 444 0 26 5323530
Mean: 1742.29 10157.5 169.544 0 2.9191
# of unique seqs: 2999654
total # of seqs: 5323530

Output File Names:
dnr.trim.contigs.good.unique.summary

It took 373 secs to summarize 5323530 sequences.

mothur > screen.seqs(fasta=current, count=current, summary=current, start=2368, end=17316, maxhomop=8)[b][b]Using dnr.trim.contigs.good.count_table as input file for the count parameter.
Using dnr.trim.contigs.good.unique.align as input file for the fasta parameter.
Using dnr.trim.contigs.good.unique.summary as input file for the summary parameter.

Using 4 processors.


Using dnr.trim.contigs.good.count_table as input file for the count parameter. Using dnr.trim.contigs.good.unique.align as input file for the fasta parameter.

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 2360 17318 13 0 2 133089
25%-tile: 2366 17318 299 0 5 1330883
Median: 2369 17318 301 0 6 2661766
75%-tile: 18996 26501 444 0 26 3992648
97.5%-tile: 0 0 0 0 1 5190442
Maximum: 18996 26501 444 0 26 5323530
Mean: 1742.29 10157.5 169.544 0 2.9191

of unique seqs: 2999654

total # of seqs: 5323530

Output File Names:
dnr.trim.contigs.good.unique.summary

It took 353 secs to summarize 5323530 sequences.


Using dnr.trim.contigs.good.unique.align as input file for the fasta parameter.

Output File Names: dnr.filter dnr.trim.contigs.good.unique.filter.fasta
Using dnr.trim.contigs.good.count_table as input file for the count parameter. Using dnr.trim.contigs.good.unique.filter.fasta as input file for the fasta parameter. 2999654 1

Output File Names:
dnr.trim.contigs.good.unique.filter.count_table
dnr.trim.contigs.good.unique.filter.unique.fasta



Using dnr.trim.contigs.good.unique.filter.count_table as input file for the count parameter. Using dnr.trim.contigs.good.unique.filter.unique.fasta as input file for the fasta parameter.

Using 20 processors.





So when running it again it has completely cut out all of the sequences in my data set[/b][/b]

Can you try it with 2 or 3 processors?