Problems handling a >50 Gb distance matrix (cluster command)

Dear Pat Schloss and mothur community,

I’ve been using mothur since mid 2009’s in several different contexts (454, cloning libraries, functional genes, etc.), and it always worked fine. Nevertheless, our recent project on marine sediments gave us a bunch of new 454 sequences that we found many problems to deal with - in a few words the phyllip distance matrix has 51.3 Gb and our lab computer has only 16 Gb RAM, and therefore the cluster command can’t calculate OTUs. I am still not sure if it is a mothur problem, a hardware problem, or both (I guess it is hardware).

In details:

  • Computer: Dell XPS 8700, Intel i7-3770 CPU @ 3.40 GHz x 8, 16 Gb RAM, Linux Ubuntu release 12.04 (precise) 64-bit.

  • Mothur used: v.1.30.2 (4/19/2013)

  • 454 GS FLX Pyrosequencing was used to sequence 16S rRNA (V4-V6) in a multiplex system with different libraries. A total of 15 libraries belong to the same project on marine sediment, they represent 5 different sampling points with 3 replicates each (5 samples X 3 replicates = 15 libraries)

  • After trimming (qual file used, barcode mismatch =1, primer mismatch=2, only Phred score > 20 accepted, minimum length 460 and maximum length 570), and quimera checking (Chimera Slayer algorithm), the total number of high quality sequences is 117,267 reads with average read length of 495 bp.

  • Alignment done on mothur using Silva database, and filtered for vertical gaps with filter.seqs command. The aligned file has 200 Mb.

-The distance matrix generated is a phyllip formated table with size 51.3 Gb. This matrix was build with cutoff=0.20

  • Clustering: the OTUs were clustered using cluster(phylip=isobatas.filter.phylip.dist), or with cutoff=0.20, and both cases it crashed after processing for several days. We always left a System Monitor window oppened so we can check if the system is running or not. Initially, mothur starts by reading the distance matrix which in our case it used 99% of RAM (15.6 Gb RAM) and also 18 Gb of swap memory, working with only 1 CPU at 100%. After 24 hours the results for unique cutoff are printed in the terminal. After 4-5 days, the CPU usage drops to 0-1% but the used memory remains the same. At this point, we became bored and stopped mothur, or we get the following message:

[many zeroes above, which are results from the unique cutoff]

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
[ERROR]: std::bad_alloc has occurred in the ClusterClassic class function getSmallCell. This error indicates your computer is running out of memory. This is most commonly caused by trying to process a dataset too large, using multiple processors, or a file format issue. If you are running our 32bit version, your memory usage is limited to 4G. If you have more than 4G of RAM and are running a 64bit OS, using our 64bit version may resolve your issue. If you are using multiple processors, try running the command with processors=1, the more processors you use the more memory is required. Also, you may be able to reduce the size of your dataset by using the commands outlined in the Schloss SOP, > http://www.mothur.org/wiki/Schloss_SOP> . If you are uable to resolve the issue, please contact Pat Schloss at mothur.bugs@gmail.com> , and be sure to include the mothur.logFile with your inquiry.%

Question 1: Big file size for a distance matrix: using the head command from linux to print the first 100 lines of that 51.3 Gb distance matrix file, we can see that mothur still saved many distances >0.20 , is this a bug?
Command example: head -n100 <distance_matrix_file.dist>

Question 2: Is the distance matrix supposed to be so big like this, or I am doing something wrong? I think the distance file size is independent of read length, but it is very dependent of number of reads. Is 117267 sequences too much?

[b]Question 3: I didn’t tried to run the clustering in a better computer, but we can try it in a super cluster computer from our institute. Before that I just want to know if anyone here has experienced similar problems with >50 Gb distance matrices. Anyone?

Question 4: Regarding the error message above, our computer has more than 4 Gb of RAM and is running a 64-bit system. We are using only 1 processor in the cluster command, because it doesn’t has any option to choose how many processors to use. It seems the error happened because there was not enough RAM memory to read the distance matrix, but interestingly if you add the used RAM and Swap memory the total size is less than the matrix (15.6 + 18 = 33.6 Gb). Where is the remaining 17.7 Gb of the matrix? Did mothur crashed because it couldn’t read it all?[/b]

Does anyone here found similar problems? I appreciate any comments.

Oi Rubens, Hi all,

I am facing similar problems. I will describe my problem in this reply.

@Rubens: I will give you my opinion in the following reply (although I am far from being a mothur expert.

Background information about my hardware:

Memory:251.8 GiB
Processor: Intel® Xeon® CPU E5-2690 0 @ 2.90GHz × 18
OS: Ubuntu Release 12.04 (precise) 64-bit

I am currently trying to cluster a column-formatted distance matrix of 63 Gb. It contains information of 1.3 M unique seqs.

First, I tried using clustering command:
cluster(column=myseqs.trim.unique.good.filter.unique.pick.dist, count=myseqs.trim.unique.good.filter.uchime.pick.count_table, cutoff=0.2)
It happened similarly what happened with Rubens. The cluster algorithm uses only one processor at time and got stuck for several days at memory usage 44 Gb. Even though I had more free memory.
I guess ,in my case, 44Gb of memory was the maximum memory that a single processor could access, therefore the process got stuck.

My current strategy is now using cluster.split():
cluster.split(column=myseqs.trim.unique.good.filter.unique.pick.dist, count=myseqs.trim.unique.good.filter.uchime.pick.count_table, processors=30)

After 5 days, it is still using one processor and currently splitting the file:

“Using 30 processors.
Using splitmethod distance.
Splitting the file…”

The memory usage is increasing slowly, so it seems it is still running. It did not reached the 44 Gb RAM usage yet.

I have 3 question for the mothur community:

  1. Is this a reasonable time for this stage of the script: 5 days to split the file in 30?

  2. Is there a best/faster way to cluster my sequences? In other words, should I change my commands/parameters?

  3. Would I be better using cluster split command based on taxonomy? For example:
    cluster.split(fasta=, count=, taxonomy=, splitmethod=classify, taxlevel=4, cutoff=0.15)



I do appreciate any input,

All the best,
Lucas

Oi Rubens,

As I said, I am far from being a mothur expert. My opinion on your questions:

Question 1: Big file size for a distance matrix: using the head command from linux to print the first 100 lines of that 51.3 Gb distance matrix file, we can see that mothur still saved many distances >0.20 , is this a bug?
I think mothur writes a bit more than the cut-off. It says on the command page “By default the cutoff parameter is set to cutoff + (5 / (precision * 10.0)). So if you set the cutoff to 0.03 with precision of 100, the cutoff would be set to 0.035.”

I think you could maximize your computer usage writing a column-formatted distance matrix and/or work with a smaller distance: 0.05, maybe? Except if you have strong reasons to work with such high cut-off value.

Question 2: Is the distance matrix supposed to be so big like this, or I am doing something wrong? I think the distance file size is independent of read length, but it is very dependent of number of reads. Is 117267 sequences too much?
I do not think 117 k sequences are too much. See the question 1 reply for some ways to improve your computer usage.

Question 3: I didn’t tried to run the clustering in a better computer, but we can try it in a super cluster computer from our institute. Before that I just want to know if anyone here has experienced similar problems with >50 Gb distance matrices. Anyone?
Yes, I am having problems with a better computer configuration. See my previous post.

Question 4: Regarding the error message above, our computer has more than 4 Gb of RAM and is running a 64-bit system. We are using only 1 processor in the cluster command, because it doesn’t has any option to choose how many processors to use. It seems the error happened because there was not enough RAM memory to read the distance matrix, but interestingly if you add the used RAM and Swap memory the total size is less than the matrix (15.6 + 18 = 33.6 Gb). Where is the remaining 17.7 Gb of the matrix? Did mothur crashed because it couldn’t read it all?
Try cluster.split command. It is not working as I wanted for me, but for you might be a good try. See also the option for splitting based on taxonomy :

@ http://www.mothur.org/wiki/MiSeq_SOP
“The alternative is to use our cluster.split command. In this approach, we use the taxonomic information to split the sequences into bins and then cluster within each bin. We’ve published results showing that if you split at the level of Order or Family, and cluster to a 0.03 cutoff, you’ll get just as good of clustering as you would with the “traditional” approach. The advantage of the cluster.split approach is that it should be faster, use less memory, and can be run on multiple processors. In an ideal world we would prefer the traditional route because “Trad is rad”, but we also think that kind of humor is funny… In this command we use taxlevel=4, which corresponds to the level of Order.”

I hope I have helped,
Abraço,
Lucas

  • After trimming (qual file used, barcode mismatch =1, primer mismatch=2, only Phred score > 20 accepted> , minimum length 460 and maximum length 570), and quimera checking (Chimera Slayer algorithm), the total number of high quality sequences is 117,267 reads with average read length of 495 bp.

That’s likely the source of your problems. If you are going to use the quality scores, then you need to set it to qwindowaverage=35, qwindowsize=50. Otherwise, you just aren’t denoising your data at all. Essentially every one of your sequences likely has an error in it making most of them unique. This then explodes on you when you get to your distance matrix. Please try following the SOP by either using the q setting I described above or use the preferred trim.flows/shhh.flows approach.

Hi Pat,

Thanks for the input. In my case, I have illumina reads, therefore I have so much unique seqs. Would you suggest a different configuration of cluster.split rather than the one I am currently using:
cluster.split(column=myseqs.trim.unique.good.filter.unique.pick.dist, count=myseqs.trim.unique.good.filter.uchime.pick.count_table, processors=30)?
Thank you for the support,
Lucas

Lucas - how are you running make.contigs?

Hi Pat,

Thank you so much for the attention to my problem. I am working with EMP data.
Actually, I did not use make.contigs().
Here is the history of my commands:


fastq.info(fastq=myseqs.fastq, format=illumina1.8+) trim.seqs(fasta=myseqs.fasta, qfile=myseqs.qual,qwindowaverage=30,qwindowsize=5, maxambig=0, maxhomop=8,minlength=100, processors=30) unique.seqs(fasta=myseqs.trim.fasta) #The perl escript is a custom script given by our collaborators to obtain the file .group perl esprit_tree_pipeline_v3.pl -fasta_file myseqs.trim.fasta -samples_list ids.csv # count.seqs(name=myseqs.trim.names, group=myseqs.trim.groups, processors=30)

summary.seqs(count=myseqs.trim.count_table, fasta=myseqs.trim.unique.fasta, processors=30)

Start End NBases Ambigs Polymer NumSeqs

#Minimum: 1 100 100 0 1 1
#2.5%-tile: 1 100 100 0 3 514083
#25%-tile: 1 100 100 0 4 5140830
#Median: 1 100 100 0 4 10281659
#75%-tile: 1 100 100 0 4 15422488
#97.5%-tile: 1 100 100 0 6 20049235
#Maximum: 1 100 100 0 8 20563317
#Mean: 1 100 100 0 4.07256

of unique seqs: 1601761

#total # of seqs: 20563317

pcr.seqs(fasta=silva.bacteria.fasta, start=11894, end=25319, keepdots=F, processors=30)

system(mv silva.bacteria.pcr.fasta silva.bacteria.V4.fasta)
align.seqs(fasta=myseqs.trim.unique.fasta, reference=./silva.bacteria/silva.bacteria.V4.fasta, processors=30, flip=t)
screen.seqs(fasta=myseqs.trim.unique.align, count=myseqs.trim.count_table, start=1968, end=4411, processors=30)
filter.seqs(vertical=T, trump=., processors=30)
unique.seqs(fasta=myseqs.trim.unique.good.filter.fasta, count=myseqs.trim.good.count_table)
chimera.uchime(fasta=myseqs.trim.unique.good.filter.unique.fasta, count=myseqs.trim.unique.good.filter.count_table, dereplicate=t, processors=31)
remove.seqs(fasta=myseqs.trim.unique.good.filter.unique.fasta, accnos=myseqs.trim.unique.good.filter.unique.uchime.accnos)

summary.seqs(fasta=myseqs.trim.unique.good.filter.unique.pick.fasta,count=myseqs.trim.unique.good.filter.count_table, processors=30)

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 196 89 0 3 1

2.5%-tile: 1 200 98 0 3 504643

25%-tile: 1 200 99 0 4 5046430

Median: 1 200 99 0 4 10092859

75%-tile: 1 200 99 0 4 15139288

97.5%-tile: 1 200 100 0 6 19681074

Maximum: 1 200 100 0 8 20185716

Mean: 1 200 98.9813 0 4.06441

# of unique seqs: 1344890

total # of seqs: 20185716

pcr.seqs(fasta=./silva.bacteria/silva.bacteria.fasta, start=11894, end=25319, taxonomy=./silva.bacteria/silva.bacteria.silva.tax, keepprimer=T, nomatch=keep, processors=30)
dist.seqs(fasta=myseqs.trim.unique.good.filter.unique.pick.fasta, cutoff=0.05, processors=30)
cluster.split(column=myseqs.trim.unique.good.filter.unique.pick.dist, count=myseqs.trim.unique.good.filter.uchime.pick.count_table, processors=30)

Running for several days… at

Using 30 processors.
Using splitmethod distance.
Splitting the file…

All the best,
Lucas

Thank you so much for the attention to my problem. I am working with EMP data.

That’s the problem. The data quality coming off the HiSeqs with a single read is quite poor. Even with the trim.seqs settings you have, I doubt you’re doing much to trim the sequences. You essentially have the same problem as the original poster. High error rate, excessively high number of uniques -> large distance matrix. I suspect you’re limited to phylotyping the data.

Hi Pat,

Thanks for your opinion. I would like to go on on a bit more with my giant dist matrix. I am trying to predict how many more days will it be running in my machine. Would you say if the memory usage of my PC reaches the size of the matrix will the file finally be splitted to my processors!? Explaining: currently the memory usage is at 47.9 Gb, and my file has 63 Gb. Once the memory usage reaches 63 Gb, would that mean the the whole matrix is already splitted and the cluster could be calculated??
Thanks,
Lucas

You should probably try splitting the fasta file by taxonomy - then you’ll have smaller distance matrices and it should go faster.

Hi Pat,

Thanks for the suggestion. It ran much faster! Actually, it is already done!!

All the best,
Lucas

Dear Pat and Lucas,

Thank you for the reply and ideas. I am right now in a conference, in Russia (!) and I will try your suggestions as soon I get back to the lab, and then I´ll post the results here.

Dear Pat,

We successfully finished the analysis! Your advice using qwindowaverage=35 and qwindowsize=50 instead of a Phred score > 20 solved our problem. The final distance matrix, with unique sequences, was around 10 Gb and we clustered it in less than 12 hours. The problem was really a huge amount of different and unique sequences due to errors and noise in our dataset. Thanks for the advice!