mothur

How to specify path on HPC for where to store the dist file?

I am working on an HPC and running the cluster.split command on my data. It’s generating a number of files and my .dist file (~1.3 TB and counting) is causing issues with other jobs running on the same server node.

The Advanced Research Computing staff has asked me whether I could supply a path as input argument to mothur or in the script for where to write that dist file? They suggested that we can give mothur a path on the file system where to store the dist file (or at least we can have symbolic link). Is that possible?

You can use the inputdir and outputdir option with all of the commands to direct input/output.

That being said, if your distance matrix is that large, it’s unlikely you will be successful in getting it to cluster. Do you have non-V4 data? You might want to see this…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/

Pat

@pschloss I have actually read that blog article many times. This data set is 2 X 250 bp reads and was generated using the 341f/806r primer set. We also include mock communities with all of our sequencing runs so we can track the error rates. This data set is rather large includes over 400 samples and I have wondered if that may be the real reason our distance matrix is so large. We don’t seem to have this issue when we are running 100-200 samples or less. Curious if your group has ever had computing issues with larger data sets and if there are any suggested work-arounds - we obviously want to run all our samples together through the whole process for downstream analysis and interpretation. Or perhaps, this is an artifact of the paired 250 reads and the larger data set and what would your recommended recourse be if you couldn’t resequence?

We’ve clustered hundreds/thousands of V4 samples without these types of problems. I suspect the problem is that with your primer set the reads won’t fully overlap like they do with the V4 primers. I think you’re sequencing the V3-V4 region with those primers, which will only have about 75 nt of overlap.

Pat

@pschloss Wishful thinking on my part then. Really curious if anyone with this data has been able to resolve the issue by just allocating more memory or if it will continue to balloon in size. In a bit of a conundrum here because we would like to see some part of the data to help inform a follow-up study where we could rectifying sequencing choice error from the previous study.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.