clustering analysis for ion torrent data

Hello,

Certain microbial samples (V3 region) were sequenced using Ion torrent which are having high amount of reads sequenced per sample ~400-700K. The initial preprocessing of reads was done using fastx_toolkit (length > 100 & Q17) & uchime and later shifted to mothur for performing denoising of reads at diff=2. As the denoising was completed I had around 70-125K reads, which i utilized for distance file (dist.seqs) and later hcluster analysis at cutoff=0.03. As i started hcluster, It seems that hcluster is taking a huge amount of time (30 hours)and space for performing clustering and some analysis are still running. I am utilizing a cluster which has 1 head node (RAM 48GB) and 3 compute node(16 processors 32 GB RAM). As the hcluster got completed for some samples, the cutoff value where changed to 0.011 & 0.01087 for two samples.
My doubts are:

  • Does the hcluster changes the cutoff based on cutoff parameter or it is independent of it?
  • Does it really takes that much time (30hrs) for about 100K reads to be clustered after distance calculation?
  • Please suggest which could be a better algorithm for clustering of reads for V3 region?
  • hcluster and cluster, which can be better considering time required for completion of analysis?

Thank you for your efforts and guidance in advance.

Warm regards,
DRAVID

That’s not going to go through - just way too much data caused by a high sequencing error rate (Q17 threshold is very low).

  • Does the hcluster changes the cutoff based on cutoff parameter or it is independent of it?

See http://www.mothur.org/wiki/Frequently_asked_questions#Why_does_the_cutoff_change_when_I_cluster_with_average_neighbor.3F for information on why the threshold changes.

  • Does it really takes that much time (30hrs) for about 100K reads to be clustered after distance calculation?

Probably more actually. I doubt you can cluster 100k unique sequences.

  • Please suggest which could be a better algorithm for clustering of reads for V3 region?
  • hcluster and cluster, which can be better considering time required for completion of analysis?

First, don’t use hcluster, it’s a disaster with average neighbor and actually takes longer to run than cluster. Use cluster or cluster.split. Second, go back to the beginning and use a much more stringent quality control filter.

Is there any detailed description of the algorithm behind clustering?
What is the run time complexity in relation to the number of sequences n, the number of clusters k, and length of sequences l ? Is it O(n^2)? O(nk)?

I have similar problem as in the previous post. Number of unique sequences (V4 region) after alignment and filtering is ~300k, of ~150 in length including gaps. It took 2 days to finish on our cluster. another group of samples contain much more reads and the run couldn’t finish in weeks.

Is it possible to add a greedy or parallel algorithm in the command to speed up clustering? We are thinking of developing our own clustering method, but it would be ideal if Mothur could provide such functions.

Thanks!

It’s all pretty standard hierarchical clustering. The problem is that IonTorrent data really suck and has a very high error rate. You have so many uniques because you have a high error rate. I suspect your time and money would be better spent either just using the phylotype approach or finding a MiSeq and using the primers and approach outlined in Kozich et al (2013; AEM) where you will get full overlap of your reads, longer contigs, and a much reduced error rate. Ion Torrent initially sounded like a good idea, but as you’ll find looking around the forum, few people (if any) have gotten it to work for 16S rRNA gene sequencing.

Pat