Enormous dist file

A concern for a run I’m doing now: I have a set of samples (sputum from people with cystic fibrosis, and from normal people, and reagent controls): 23 samples in all. I have all the fastq R1 and R2 files from a MiSeq run, etc. I’m running Mothur 1.34.3 but have the same issue in 1.36. I’m using the MiSeq SOP and have a batch file that I know works from that. I’m using Silva v119.

The issue: with these 23 samples, I have a 1.03 TB dist file (stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist) that took 275,445 seconds, or just over 68 hours, to generate. I’ve simply not encountered this in previous Mothur runs even when I had considerably more samples. For example, for a paper I have in review right now that had 58 samples, using Mother 1.34.1 earlier this year, I had a 27.7 GB dist file.

This seems to be a consistent issue, as I’ve re-run my code with the 23 samples and have the same file size.

I tried to do cluster.split instead and had an error: “‘HWI-M20149_246_000000000-AHBRF_1_1101_15057_5312’ is not in your name or count file, please correct.” If I go into the count.table and put that in (copying the code for another line, substituting the above name and giving it a total count of 1), I get an error for another, different OTU.

I’m now using the dist file generated above to do a cluster (next step in the SOP). As one can imagine, it’s taking a while.

So I think there’s something wrong but don’t know what it is. Any thoughts?


Steve White

The file mismatch error is likely due to leaving the name or count file off on a previous command. Did you forget to include it with one of the get / remove.seqs commands? The size of your distance matrix is very large. Have you read Pat’s blog on the issue, http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/?

Thanks for the tip on Pat’s file. I may or may not be in that situation (that covers all the possibilities).

I spoke to the senior tech at Argonne who processed my samples. I had 60 samples total (2 different experiments). They get 15 - 19 million reads per sequencing run on their MiSeq platform that then is divided among all the samples in that run. My PAST samples from Argonne generally had 10-40K sequences each because they were put in very large batches.

But in this run, I had 60 samples that were run separately at my insistence because I had a grant deadline and couldn’t wait. The samples in this experiment are sputum samples from two groups of patients: cystic fibrosis and normal subjects. I had a few reagent controls as well. I had two experiments and am processing them separately. For this one, I have 23 specimens: the counts just after make.contigs –

CF sputum samples – 130K to 340K total seqs per sample
Normal sputum samples – 150K to 360K total seqs per sample
Reagent controls – <1K to 15K seqs per sample

In contrast, a manuscript that is in review now from my lab, with deep lung samples, has 10-40K seqs per sample. That was done as part of a very large sequencing run with a few hundred samples, so of course each sample has fewer reads.

I double-checked my batch file just now, which is right out of the latest version of the MiSeq SOP. I don’t see any missing count tables or name files.

By the time I past chimera checking, per summary.seqs I have 327K uniques and 3.96M total seqs. I think this is why dist.seqs is gagging.

As to Pat’s concern as stated in his blog, Argonne does 2 x 150 bp reads. My person there is pretty darned adamant that for V4 (only), with an amplicon of 253 bp, that this is sufficient for proper overlap. She thinks a 2 x 250 bp read is overkill and wasteful. She and Pat may need to agree to disagree!

Another potential issue: I’m running all this on a iMac Core i7 with 24 GB memory. Might not be enough muscle. I don’t know how to run this on a server, but perhaps some advice on this and I can get our informatics people to set something up for me?

Any other thoughts? Many thanks in advance and sorry for the length of the response.

Just an update: I ran cluster.split (got it to work) using the file option. I no longer have a 1 TB dist file, but I have a bunch of smaller dist files, the largest of which is ~ 240GB. I’m running it now using the ‘file’ that was generated. Does anyone have a sense of how big a dist file can be in this option before Mothur fails?

The size of the distance file mothur can process is directly related to the amount of memory you have. With the cluster.split command, if you are running the command with processors=1, mothur will need more memory than your largest distance file.

Thanks for that. I’m running all this from an iMac with a core i7 processor, so 1 processor is all I have. If I put in a number other than 1 it doesn’t matter, as I recall.

I’m going to look into getting our university informatics team set up something on a server to let me run Mothur with some more muscle for this.