missing name after pre.cluster command (v1.25.0)

Hi there,

mothur usually works fine for us.

We analysed our .sff data (reverse primers) with mothur v1.24.1 and it worked fine. With the same input data and mothur commands in mothur v1.25.0, we get now an error message after using the pre.cluster command:

sffinfo(sff=in.sff, flow=T)
trim.flows(flow=in.flow, oligos=oligos.txt, bdiffs=1, pdiffs=2, minflows=360, maxflows=720, processors=8)
shhh.flows(file=in.flow.files, processors=8)
trim.seqs(fasta=in.shhh.fasta, name=in.shhh.names, oligos=oligos.txt, flip=T, pdiffs=2, bdiffs=1, maxhomop=8, minlength=200, processors=8)
summary.seqs(fasta=in.shhh.trim.fasta, name=in.shhh.trim.names)
unique.seqs(fasta=in.shhh.trim.fasta, name=in.shhh.trim.names)
summary.seqs(fasta=in.shhh.trim.unique.fasta, name=in.shhh.trim.names)
align.seqs(fasta=in.shhh.trim.unique.fasta, reference=silva.bacteria.fasta, processors=8)
summary.seqs(fasta=in.shhh.trim.unique.align, name=in.shhh.trim.names)
screen.seqs(fasta=in.shhh.trim.unique.align, name=in.shhh.trim.names, group=in.shhh.groups, optimize=start-end, criteria=99, minlength=200, processors=8)
summary.seqs(fasta=in.shhh.trim.unique.good.align, name=in.shhh.trim.good.names)
filter.seqs(fasta=in.shhh.trim.unique.good.align, vertical=T, trump=., processors=8)
unique.seqs(fasta=in.shhh.trim.unique.good.filter.fasta, name=in.shhh.trim.good.names)

pre.cluster(fasta=in.shhh.trim.unique.good.filter.unique.fasta, name=in.shhh.trim.unique.good.filter.names, group=in.shhh.good.groups, diffs=2)
missing name HB93FIC05F00A2

missing name HB93FIC05GNOSQ

[ERROR]: Your name file contains 28709 valid sequences, and your groupfile contains 44039, please correct.

All in all, 15330 names seem to be missing in the names file (or shoud be removed from the group file). Where did we go wrong?

We already tried to adapt the the screen.seqs command, but the error messages still popped up.

Perhaps, somebody can help us to figure out how to improve the commands in order to avoid those error messages?

Any help is highly appreciated.

Thanks a lot in advance.

Regards, stef

screen.seqs(fasta=in.shhh.trim.unique.align, name=in.shhh.trim.names, group=in.shhh.groups, optimize=start-end, criteria=99, minlength=200, processors=8)

I think what you want is…

screen.seqs(fasta=in.shhh.trim.unique.align, name=in.shhh.trim.unique.names, group=in.shhh.groups, optimize=start-end, criteria=99, minlength=200, processors=8)
in.shhh.trim.unique.names is the output from unique.seqs

Thank you so much for your reply.

I am currently running the modified script, and everything looks perfect now.

Thanks again, stef

I recently discovered the precluster command, which has greatly reduced the time to cluster my set of 10 environmental samples. I am following a protocol very similar to the one in the above post, but I do not use groups at this point, I make groups and a shared file after I finish clustering. I am not sure I understand how precluster deals with grouping sample names together, and I am afraid I am losing data. Please see below for the number of sequences at each step:

…good.filter.fasta= 249,448

I am concerned that if there is a unique sequence that is shared between samples, my end product does not parse that information back out and I am left with a unique sequence from only one sample.



I suspect you’re not giving the names file to either unique.seqs or pre.cluster