Dear Pat Schloss and mothur community,
I’ve been using mothur since mid 2009’s in several different contexts (454, cloning libraries, functional genes, etc.), and it always worked fine. Nevertheless, our recent project on marine sediments gave us a bunch of new 454 sequences that we found many problems to deal with - in a few words the phyllip distance matrix has 51.3 Gb and our lab computer has only 16 Gb RAM, and therefore the cluster command can’t calculate OTUs. I am still not sure if it is a mothur problem, a hardware problem, or both (I guess it is hardware).
In details:
-
Computer: Dell XPS 8700, Intel i7-3770 CPU @ 3.40 GHz x 8, 16 Gb RAM, Linux Ubuntu release 12.04 (precise) 64-bit.
-
Mothur used: v.1.30.2 (4/19/2013)
-
454 GS FLX Pyrosequencing was used to sequence 16S rRNA (V4-V6) in a multiplex system with different libraries. A total of 15 libraries belong to the same project on marine sediment, they represent 5 different sampling points with 3 replicates each (5 samples X 3 replicates = 15 libraries)
-
After trimming (qual file used, barcode mismatch =1, primer mismatch=2, only Phred score > 20 accepted, minimum length 460 and maximum length 570), and quimera checking (Chimera Slayer algorithm), the total number of high quality sequences is 117,267 reads with average read length of 495 bp.
-
Alignment done on mothur using Silva database, and filtered for vertical gaps with filter.seqs command. The aligned file has 200 Mb.
-The distance matrix generated is a phyllip formated table with size 51.3 Gb. This matrix was build with cutoff=0.20
- Clustering: the OTUs were clustered using cluster(phylip=isobatas.filter.phylip.dist), or with cutoff=0.20, and both cases it crashed after processing for several days. We always left a System Monitor window oppened so we can check if the system is running or not. Initially, mothur starts by reading the distance matrix which in our case it used 99% of RAM (15.6 Gb RAM) and also 18 Gb of swap memory, working with only 1 CPU at 100%. After 24 hours the results for unique cutoff are printed in the terminal. After 4-5 days, the CPU usage drops to 0-1% but the used memory remains the same. At this point, we became bored and stopped mothur, or we get the following message:
[many zeroes above, which are results from the unique cutoff]
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
[ERROR]: std::bad_alloc has occurred in the ClusterClassic class function getSmallCell. This error indicates your computer is running out of memory. This is most commonly caused by trying to process a dataset too large, using multiple processors, or a file format issue. If you are running our 32bit version, your memory usage is limited to 4G. If you have more than 4G of RAM and are running a 64bit OS, using our 64bit version may resolve your issue. If you are using multiple processors, try running the command with processors=1, the more processors you use the more memory is required. Also, you may be able to reduce the size of your dataset by using the commands outlined in the Schloss SOP, > http://www.mothur.org/wiki/Schloss_SOP> . If you are uable to resolve the issue, please contact Pat Schloss at mothur.bugs@gmail.com> , and be sure to include the mothur.logFile with your inquiry.%
Question 1: Big file size for a distance matrix: using the head command from linux to print the first 100 lines of that 51.3 Gb distance matrix file, we can see that mothur still saved many distances >0.20 , is this a bug?
Command example: head -n100 <distance_matrix_file.dist>
Question 2: Is the distance matrix supposed to be so big like this, or I am doing something wrong? I think the distance file size is independent of read length, but it is very dependent of number of reads. Is 117267 sequences too much?
[b]Question 3: I didn’t tried to run the clustering in a better computer, but we can try it in a super cluster computer from our institute. Before that I just want to know if anyone here has experienced similar problems with >50 Gb distance matrices. Anyone?
Question 4: Regarding the error message above, our computer has more than 4 Gb of RAM and is running a 64-bit system. We are using only 1 processor in the cluster command, because it doesn’t has any option to choose how many processors to use. It seems the error happened because there was not enough RAM memory to read the distance matrix, but interestingly if you add the used RAM and Swap memory the total size is less than the matrix (15.6 + 18 = 33.6 Gb). Where is the remaining 17.7 Gb of the matrix? Did mothur crashed because it couldn’t read it all?[/b]
Does anyone here found similar problems? I appreciate any comments.