pre.cluster bug?

Kendra · May 30, 2012, 9:08pm

I have a 16s pyrotag data set with 2.4 million good reads, 1.1 unique. I ran pre.cluster using the names and groups option then summary.seqs with the name file and got 0.5M unique and 1.3M total seqs. Shouldn’t the total still be 2.4 M?

I’m running v1.24.1 on Ubuntu 11.10

thanks

pschloss · May 31, 2012, 12:35pm

Can you post all of the commands from running summary.seqs(fasta=, name=) before pre.cluster, pre.cluster(fasta=, name=, group=), and then summary.seqs(fasta=, name=) again?

Kendra · May 31, 2012, 5:39pm

mothur > summary.seqs(fasta=bac.unique.good.filter.unique.fasta, name=bac.unique.good.filter.names, processors=2)

Start End NBases Ambigs Polymer NumSeqs Minimum: 1 1384 194 0 3 1 2.5%-tile: 1 1384 230 0 3 58851 25%-tile: 1 1384 234 0 4 588508 Median: 1 1384 245 0 4 1177016 75%-tile: 1 1384 260 0 5 1765524 97.5%-tile: 1 1384 275 0 6 2295181 Maximum: 6 1384 408 0 10 2354031 Mean: 1.00832 1358.42 246.122 0 4.38665 # of unique seqs: 1062353 total # of seqs: 2354031

Output File Name:
bac.unique.good.filter.unique.fasta.summary

mothur > mothur > pre.cluster(fasta=bac.unique.good.filter.unique.fasta, name=bac.unique.good.filter.names, group=bac.pick.groups, diffs=2, processors=5)

mothur > mothur > summary.seqs(fasta=bac.unique.good.filter.unique.precluster.fasta, name=bac.unique.good.filter.unique.precluster.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1384 194 0 3 1
2.5%-tile: 1 1384 230 0 3 34327
25%-tile: 1 1384 234 0 4 343266
Median: 1 1384 246 0 4 686532
75%-tile: 1 1384 260 0 5 1029797
97.5%-tile: 1 1384 275 0 6 1338736
Maximum: 6 1384 408 0 10 1373062
Mean: 1.00777 1399.49 245.643 0 4.40728

of unique seqs: 534307

total # of seqs: 1373062

Output File Name:
bac.unique.good.filter.unique.precluster.fasta.summary

Strejda · June 1, 2012, 9:54am

hi there,
In the second summary.seqs command you used wrong name file.
cheers

Kendra · June 1, 2012, 6:00pm

That’s the name file that precluster spit out

Kendra · June 1, 2012, 10:27pm

I decided to repeat pre.cluster too see if I get the same issue. I’ve tried repeating it 3 times and each time it crashes (and nearly crashes the whole system) when it’s processing one particular group. so looks like that could be the problem. But looking at the groups file and the fasta, I can’t figure out what the problem is. The sequences from all the groups are named the same way (no crazy characters) and the group does have sequences in the fasta. What else should I look at to try to find the problem?

Command history with these sequences (data from another researcher, should be post quality filtering but they couldn’t provide the qual files for me to check that):

unique.seqs(fasta=bac.fasta)
align.seqs(fasta=bac.unique.fasta, reference=silva.bacteria.fasta, processors=5)
summary.seqs(fasta=bac.unique.align, name=bac.names)
screen.seqs(fasta=bac.unique.align, name=bac.names, group=bac.group, end=5705, start=1046, processors=5)
summary.seqs(fasta=bac.unique.good.align, name=bac.good.names)
filter.seqs(fasta=bac.unique.good.align, vertical=T, trump=., processors=5)
unique.seqs(fasta=bac.unique.good.filter.fasta, name=bac.names)
list.seqs(fasta=bac.unique.good.filter.unique.fasta)
get.seqs(accnos=current, group=bac.groups, name=bac.unique.good.filter.names)
pre.cluster(fasta=bac.unique.good.filter.unique.fasta, name=bac.unique.good.filter.names, group=bac.pick.groups, diffs=2, processors=5)

westcott · June 4, 2012, 5:01pm

You have a few file name mismatches.

unique.seqs(fasta=bac.fasta)
align.seqs(fasta=bac.unique.fasta, reference=silva.bacteria.fasta, processors=5)
summary.seqs(fasta=bac.unique.align, name=bac.names)
screen.seqs(fasta=bac.unique.align, name=bac.names, group=bac.group, end=5705, start=1046, processors=5)
summary.seqs(fasta=bac.unique.good.align, name=bac.good.names)
filter.seqs(fasta=bac.unique.good.align, vertical=T, trump=., processors=5)
unique.seqs(fasta=bac.unique.good.filter.fasta, name=bac.good.names)
list.seqs(fasta=bac.unique.good.filter.unique.fasta)
get.seqs(accnos=current, group=bac.good.groups, name=bac.unique.good.filter.names)
pre.cluster(fasta=bac.unique.good.filter.unique.fasta, name=bac.unique.good.filter.pick.names, group=bac.good.pick.groups, diffs=2, processors=5)

It may be easier to use the current option, so you avoid file name mismatches. Mothur will output the name of the file it uses with each command.

unique.seqs(fasta=bac.fasta)
align.seqs(fasta=current, reference=silva.bacteria.fasta, processors=5)
summary.seqs(fasta=current, name=current)
screen.seqs(fasta=current, name=current, group=bac.group, end=5705, start=1046, processors=5)
summary.seqs(fasta=current, name=current)
filter.seqs(fasta=current, vertical=T, trump=., processors=5)
unique.seqs(fasta=current, name=current)
list.seqs(fasta=current)
get.seqs(accnos=current, group=current, name=current)
pre.cluster(fasta=current, name=current, group=current, diffs=2, processors=5)

Kendra · June 4, 2012, 6:32pm

I’d forgotten about the current option. reran everything up to pre.cluster but still it crashes after reading in the first few groups. it also crashes firefox and nautilus when it gets hung up on precluster?

westcott · June 4, 2012, 7:52pm

Could you send your bac.fasta, bac.group and logfile to mothur.bugs@gmail.com, and I will try to track down the problem?

Kendra · June 4, 2012, 8:21pm

I’d love to have your help but the files are huge. fasta 1.6gb, names 560mb, groups 400mb-the way the sequences have been named is a bit ridiculous (the whole RDP taxonomic string as part of the sequence name) which is why those 2 files are so huge. any non gmail way to get them too you? or maybe a short list of things that you would look for to begin with?

Kendra · June 12, 2012, 10:23pm

update for anyone else having this type of problem. Westcott suggested that I might not have enough RAM (12gb) to run this file on 4 processors. She is correct, running pre.cluster on a single processor worked great

Topic		Replies	Views
pre.cluster drop of sequences...? Commands in mothur	2	2170	January 31, 2013
An error occurs while running pre.cluster command Commands in mothur	7	698	February 13, 2023
Issue with pre.cluster Commands in mothur	10	495	October 30, 2023
Pre.cluster removes the majority of sequences and names mismatch mothur bugs	2	749	July 5, 2021
pre.cluster with fasta and name file - a faster implementation? Feature requests	0	1957	September 5, 2016

pre.cluster bug?

of unique seqs: 534307

Related topics