cluster(method=average)

I have som problems with the cluster command. I have generated a dist. file with a size of 19 gigs (1.6 millions reads from 21 samples reduced to 80.000 reads). It’s no problem to load the file into the rams using the read.dist command (I’m working on a cluster with 24 gigs of ram and 8 processors). However, with the cluster command I got the following output after 10 min, but since nothing has happened. Its now running on its 10. hour. Is that normal with such a big file. or is the file simply to big for this cluster? Is there a way to overcome this problem?

unique 31722 67497 6722 2019 972 592 396 262 200 166 123 108 80 60 62 64 48 52 41 29 28 29 219 13 19 11 12 16 8 11 8 9 15 4 10 5 5 10 7 6 3 8 3 8 6 71

0.00 31735 66769 6769 2006 974 588 396 263 194 169 116 107 84 65 62 62 50 49 40 32 32 31 19 22 15 12 16 10 19 7 13 7 10 12 7 8 4 5 11 7 4 3 7 4 7 8 7 5 3 7 3 2 5 6 2 4 2 1

Anders

Anders,

That sounds about right, sorry. However, have you done all of the steps in the Costello example analyses - quality filtering, chimera checking, uniquing, trimming so the sequences overlap, pre-clustering, etc? I’d be surprised if you really have 80000 unique sequences.

Pat

Then I have to move to a bigger cluster. Have you got any ideas how many gigs of rams the cluster should have to handle such a big dist. file? it will be possible for me to use a cluster with 256 gigs for the calculations. About the preprocessing you are right, I have changed the qwindowaverage from 35 to 30, so I get a better understanding of the processes and to se what the cluster could handle, as I will add more sequences to the analysis when the last samples are sequenced. Besides this, I have strictly followed the Costello example. For the real analysis, I will change the qwindowaverage back to 35.

To my knowledge, almost none of the published papers dealing with 16S 454 pyroseq. reads follows your strict rules for processing of the reads. Do you think that if your rules for processing were applied to these data it would have changed the results and conclusions found in these many papers now published?

Anders,

Could you shoot me an email - I have something for you to try out.

I’d have to say that whether the quality trimming matters really depends on the question. I know that the error rate when we do this quality trimming drops from 0.71% to 0.06%.

Check out these plots…
Upload not implemented: Slide04.jpg
The above is from a mock community and shows the variation in quality scores with sequencing error type. If you plot the quality scores that correspond to substitutions, it looks like a U with a peak at 10 and a peak at 40. I suspect the substitutions at 40 represent PCR errors and those at 10 are sequencing errors. Something like pre.cluster would remove the PCR errors.
Upload not implemented: Slide05.jpg
The above shows how the error rate varies along the length of the sequence (note that the x-axis is reversed).

The take home message is that errors accumulate in the last 200 bp of the sequences. I’d be happy to hear anyone’s feedback on these results. I am in the process of writing this up and hope to have the manuscript submitted by the end of the month.

Thanks,
Pat

Thx for the reply. My email is anders.jensen@microbiology.au.dk. It some very interesting results. I will have a look at it later. By the way, do you have any idea how much ram the cluster should have to handle a dist. file of 19 gigs?

Anders

My pyroseq dataset is getting bigger and bigger. My 4 processor/16G RAM server freezes when clustering with a dist. file of 4 gigs. I am also interested to know how much ram is required for handling 50-80,000 final seqs after cleaning up by screen, chimeraslayer, filter and unique etc…

John

I think it’s safe to expect a 3-fold increase in the amount of required RAM over the distance matrix file size - assuming that it is the column format and was created using a cutoff.

I have just finished the clustering of a 9 gigs dist.file( column-formatted and a cutoff of 0.10) on a cluster with 24 gb ram. It took almost 17 hours to complete. However, even though I set the cutoff to 0.1 in both the dist.seqs and read.dist it changed the cutoff the 0.037 in the cluster-analysis. Can you tell why the cutoff is changed? What is the expected amount of required ram for a phylip matrix, as I would very much like to run the phylogenetic-based analyses?

Anders

This is because average neighbor averages large and small distances to get the new threshold. So if you are merging sequences from two OTUs there may be pairs of sequences between the two OTUs with distances larger than the threshold. In this case, we reset the threshold to something smaller so that you aren’t merging sequences that have distances above the threshold. So if you put in a cutoff of 0.10 it is likely that the cutoff will drop to 0.03ish. We use a cutoff of 0.20 and then use average neighbor to get OTUs at cutoffs between 0.00 and 0.10.

Pat