pre.cluster bug?

I have a 16s pyrotag data set with 2.4 million good reads, 1.1 unique. I ran pre.cluster using the names and groups option then summary.seqs with the name file and got 0.5M unique and 1.3M total seqs. Shouldn’t the total still be 2.4 M?

I’m running v1.24.1 on Ubuntu 11.10

thanks

Can you post all of the commands from running summary.seqs(fasta=, name=) before pre.cluster, pre.cluster(fasta=, name=, group=), and then summary.seqs(fasta=, name=) again?

mothur > summary.seqs(fasta=bac.unique.good.filter.unique.fasta, name=bac.unique.good.filter.names, processors=2)


Start End NBases Ambigs Polymer NumSeqs Minimum: 1 1384 194 0 3 1 2.5%-tile: 1 1384 230 0 3 58851 25%-tile: 1 1384 234 0 4 588508 Median: 1 1384 245 0 4 1177016 75%-tile: 1 1384 260 0 5 1765524 97.5%-tile: 1 1384 275 0 6 2295181 Maximum: 6 1384 408 0 10 2354031 Mean: 1.00832 1358.42 246.122 0 4.38665 # of unique seqs: 1062353 total # of seqs: 2354031

Output File Name:
bac.unique.good.filter.unique.fasta.summary


mothur > mothur > pre.cluster(fasta=bac.unique.good.filter.unique.fasta, name=bac.unique.good.filter.names, group=bac.pick.groups, diffs=2, processors=5)

mothur > mothur > summary.seqs(fasta=bac.unique.good.filter.unique.precluster.fasta, name=bac.unique.good.filter.unique.precluster.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1384 194 0 3 1
2.5%-tile: 1 1384 230 0 3 34327
25%-tile: 1 1384 234 0 4 343266
Median: 1 1384 246 0 4 686532
75%-tile: 1 1384 260 0 5 1029797
97.5%-tile: 1 1384 275 0 6 1338736
Maximum: 6 1384 408 0 10 1373062
Mean: 1.00777 1399.49 245.643 0 4.40728

of unique seqs: 534307

total # of seqs: 1373062

Output File Name:
bac.unique.good.filter.unique.precluster.fasta.summary

hi there,
In the second summary.seqs command you used wrong name file.
cheers

That’s the name file that precluster spit out

I decided to repeat pre.cluster too see if I get the same issue. I’ve tried repeating it 3 times and each time it crashes (and nearly crashes the whole system) when it’s processing one particular group. so looks like that could be the problem. But looking at the groups file and the fasta, I can’t figure out what the problem is. The sequences from all the groups are named the same way (no crazy characters) and the group does have sequences in the fasta. What else should I look at to try to find the problem?

Command history with these sequences (data from another researcher, should be post quality filtering but they couldn’t provide the qual files for me to check that):

unique.seqs(fasta=bac.fasta)
align.seqs(fasta=bac.unique.fasta, reference=silva.bacteria.fasta, processors=5)
summary.seqs(fasta=bac.unique.align, name=bac.names)
screen.seqs(fasta=bac.unique.align, name=bac.names, group=bac.group, end=5705, start=1046, processors=5)
summary.seqs(fasta=bac.unique.good.align, name=bac.good.names)
filter.seqs(fasta=bac.unique.good.align, vertical=T, trump=., processors=5)
unique.seqs(fasta=bac.unique.good.filter.fasta, name=bac.names)
list.seqs(fasta=bac.unique.good.filter.unique.fasta)
get.seqs(accnos=current, group=bac.groups, name=bac.unique.good.filter.names)
pre.cluster(fasta=bac.unique.good.filter.unique.fasta, name=bac.unique.good.filter.names, group=bac.pick.groups, diffs=2, processors=5)

You have a few file name mismatches.

unique.seqs(fasta=bac.fasta)
align.seqs(fasta=bac.unique.fasta, reference=silva.bacteria.fasta, processors=5)
summary.seqs(fasta=bac.unique.align, name=bac.names)
screen.seqs(fasta=bac.unique.align, name=bac.names, group=bac.group, end=5705, start=1046, processors=5)
summary.seqs(fasta=bac.unique.good.align, name=bac.good.names)
filter.seqs(fasta=bac.unique.good.align, vertical=T, trump=., processors=5)
unique.seqs(fasta=bac.unique.good.filter.fasta, name=bac.good.names)
list.seqs(fasta=bac.unique.good.filter.unique.fasta)
get.seqs(accnos=current, group=bac.good.groups, name=bac.unique.good.filter.names)
pre.cluster(fasta=bac.unique.good.filter.unique.fasta, name=bac.unique.good.filter.pick.names, group=bac.good.pick.groups, diffs=2, processors=5)

It may be easier to use the current option, so you avoid file name mismatches. Mothur will output the name of the file it uses with each command.

unique.seqs(fasta=bac.fasta)
align.seqs(fasta=current, reference=silva.bacteria.fasta, processors=5)
summary.seqs(fasta=current, name=current)
screen.seqs(fasta=current, name=current, group=bac.group, end=5705, start=1046, processors=5)
summary.seqs(fasta=current, name=current)
filter.seqs(fasta=current, vertical=T, trump=., processors=5)
unique.seqs(fasta=current, name=current)
list.seqs(fasta=current)
get.seqs(accnos=current, group=current, name=current)
pre.cluster(fasta=current, name=current, group=current, diffs=2, processors=5)

I’d forgotten about the current option. reran everything up to pre.cluster but still it crashes after reading in the first few groups. it also crashes firefox and nautilus when it gets hung up on precluster?

Could you send your bac.fasta, bac.group and logfile to mothur.bugs@gmail.com, and I will try to track down the problem?

I’d love to have your help but the files are huge. fasta 1.6gb, names 560mb, groups 400mb-the way the sequences have been named is a bit ridiculous (the whole RDP taxonomic string as part of the sequence name) which is why those 2 files are so huge. any non gmail way to get them too you? or maybe a short list of things that you would look for to begin with?

update for anyone else having this type of problem. Westcott suggested that I might not have enough RAM (12gb) to run this file on 4 processors. She is correct, running pre.cluster on a single processor worked great