cluster.split(processors=32) regenerating large *temp?

Hello all,

I have come to understand the controversy among V3 versus V2 chemistry and its generation of nasty large distance matrices (mine is 1.8T)… I’ve read the issues with this but size and memory are not quite the issue as I’m working on a high performance cluster. I’m using Illumina paired end bacterial variable region 4 data; more or less following the Mothur MiSeq SOP mixed with this example: https://www.abdn.ac.uk/genomics/documents/Mothor_training_guide.pdf

My problem point is running cluster.split (as #cluster is not parallel)

cluster.split(column=ancil.trim.contigs.good.unique.good.filter.precluster.pick.pick.subsample.dist,
name=ancil.trim.contigs.good.unique.good.filter.precluster.pick.pick.subsample.names
large=T, processors=36)

However as I’ve watched the .dist..temp file grow in size, it seemed to delete itself and start rebuilding again (somewhere around 1TB it started making a new disttemp). Is mothur working as it should? Other dist..temp files are being created, but they’re only in the kilobyte range. I’m running the most current version, 1.37.5

I tried my full set of commands in a script that worked with one sample, then I tried two, then about 6, all using #cluster. Now that it’s scaled it up to 96 samples… I thought it useful to go with #cluster.split. Am I missing something- or just being impatient?

Thanks for reading!

I’m not sure this is worth trying unless you really haver multiple TB worth of RAM. You will get a separate dist temp file for each taxonomic level so it will keep generating new files as it calculates the distances for each level. Your local sysadmin might also have settings keeping you from generating such large files.

Pat