How to specify path on HPC for where to store the dist file?

cgivens · February 18, 2020, 9:34pm

I am working on an HPC and running the cluster.split command on my data. It’s generating a number of files and my .dist file (~1.3 TB and counting) is causing issues with other jobs running on the same server node.

The Advanced Research Computing staff has asked me whether I could supply a path as input argument to mothur or in the script for where to write that dist file? They suggested that we can give mothur a path on the file system where to store the dist file (or at least we can have symbolic link). Is that possible?

pschloss · February 20, 2020, 3:18pm

You can use the inputdir and outputdir option with all of the commands to direct input/output.

That being said, if your distance matrix is that large, it’s unlikely you will be successful in getting it to cluster. Do you have non-V4 data? You might want to see this…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/

Pat

cgivens · February 20, 2020, 3:52pm

@pschloss I have actually read that blog article many times. This data set is 2 X 250 bp reads and was generated using the 341f/806r primer set. We also include mock communities with all of our sequencing runs so we can track the error rates. This data set is rather large includes over 400 samples and I have wondered if that may be the real reason our distance matrix is so large. We don’t seem to have this issue when we are running 100-200 samples or less. Curious if your group has ever had computing issues with larger data sets and if there are any suggested work-arounds - we obviously want to run all our samples together through the whole process for downstream analysis and interpretation. Or perhaps, this is an artifact of the paired 250 reads and the larger data set and what would your recommended recourse be if you couldn’t resequence?

pschloss · February 20, 2020, 4:21pm

We’ve clustered hundreds/thousands of V4 samples without these types of problems. I suspect the problem is that with your primer set the reads won’t fully overlap like they do with the V4 primers. I think you’re sequencing the V3-V4 region with those primers, which will only have about 75 nt of overlap.

Pat

cgivens · February 20, 2020, 5:48pm

@pschloss Wishful thinking on my part then. Really curious if anyone with this data has been able to resolve the issue by just allocating more memory or if it will continue to balloon in size. In a bit of a conundrum here because we would like to see some part of the data to help inform a follow-up study where we could rectifying sequencing choice error from the previous study.

system · March 1, 2020, 5:52pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issues with cluster command Commands in mothur	5	4452	December 19, 2012
Clustering 368gb dist file Commands in mothur	5	1127	September 28, 2016
read.dist aborting before finishing Commands in mothur	6	5214	July 7, 2010
dist.seq- taking lot of disk space Commands in mothur	1	1208	January 28, 2016
group file problem for read.dist command Commands in mothur	1	3004	April 20, 2010

How to specify path on HPC for where to store the dist file?

Related topics