sens.spec call as part of new cluster.split in 1.39 causing crash

Hi Folks,
Have been re-running a v4 dataset (77 samples, ~150K reads/sample) through 1.39 to benchmark against 1.38.1.
Overall, I’m really, really impressed with the speedup (thanks Pat and Sarah!!!), but it seems like cluster.split is calling sens.spec after it finishes clustering and merging the clustered files—and, this seems to lead to me running out of RAM (128GB) and crashing.
These data ran fine on this server in 1.38.1—so I don’t think this is an issue with oversized distance matrices. . .
I’m not seeing much in the documentation about what sens.spec does—looks like it’s scoring the quality of the OTU calls?
Thanks for looking into this, and for all you do for the microbial ecology community!

-Adam Mumford

last bit of the log before it goes down:

It took 4593 seconds to cluster
Merging the clustered files…
It took 14 seconds to merge.
/******************************************/
Running command: sens.spec(list=PASFUOG_Spring_2016.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.opti_mcc.unique_list.list, column=PASFUOG_Spring_2016.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=PASFUOG_Spring_2016.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table)

NOTE: sens.spec assumes that only unique sequences were used to generate the distance matrix.

Glad you like it! Sarah will follow up with you, but you might try it without doing cluster.split. In our testing (see the preprint), it wasn’t necessarily faster and the output was a little worse than using the normal cluster command.

Pat

Holy $^&! that’s fast.
Looks like it works just fine running from cluster instead of cluster.split. I was about to complain about it using only one processor—but then it finished before I could compose any sort of question.
At the risk of derailing the conversation and getting into the weeds—it looked like it lit off vsearch during the call to cluster, without it being specifically asked—is OptiClust using one of the vsearch algorithms in a some way that’s not clear to me from the preprint?
Thanks!
-Adam


from the log: mothur > cluster(fasta=PASFUOG_Spring_2016.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=PASFUOG_Spring_2016.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table, cutoff=0.15, processors=24)

Using 24 processors.
[NOTE]: Default clustering method has changed to opti. To use average neighbor, set method=average.
[WARNING]: You can only use the processors option when using the agc or dgc clustering methods. Using 1 processor.
./home/amumford/mothur-1.37.5/mothurvsearch file does not exist. Checking path…
Found vsearch in your path, using /usr/local/bin/vsearch
It took 509 seconds to cluster

Hmmm, no it’s not using vsearch. We’ll check on that error message.

Quick update—
I seem to have gotten it to run by going from 24 to 20 processors—maybe that took the load off the RAM and let it finish?
Also—cluster appears to launch vsearch when it’s not explicitly given a distance matrix to start, otherwise it’ll run the new opti.
I’m wondering—is there any speed/memory benefit to running cluster.split to get the distance matrix rather than running dist.seqs?
Thanks for all the work!
-Adam

We released version 1.39.1, https://github.com/mothur/mothur/releases/tag/v1.39.1. It includes a parameter runsensspec that allows you to indicate you whether you want to run the sens.spec command on the completed list file. You can set runsensspec=F to skip this step. For the vsearch question, could you post the cluster.split command you ran and the output?

Sorry for the delay on getting back to you on this. . .here’s where it called ‘vsearch’ when asked for ‘opti’. I’m realizing now that I needed to run dist.seqs first if I wanted opti to have Something to cluster…
Thank for giving us the option to turn off sensspec, that seems to help.
Cheers,
-Adam


mothur > cluster(fasta=data.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=data.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.count_table, method = opti, cutoff=0.15, processors=24)

Using 24 processors.
[WARNING]: You can only use the processors option when using the agc or dgc clustering methods. Using 1 processor.
./home/amumford/mothur-1.37.5/mothurvsearch file does not exist. Checking path…
Found vsearch in your path, using /usr/local/bin/vsearch
It took 507 seconds to cluster

Output File Names:
data.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.opti.unique_list.list