Query for extended hours in cluster.seqs command for 16S RNA sequence reads

I am having trouble in clustering the sequences, i am using about 203000 forward read sequence in fasta files for analysis. however, after running distance sequence command after aligning against greengenes reference alignment, it came up to be 51 GB in data for the sample.dist file. and now the clustering process is running since 36 hours and still results are pending.

Hi,

First - I strongly recommend against using greengenes as a reference alignment because it does a horrible job in the variable regions. Use the silva reference alignment as described in the MiSeq SOP.

Second - I’m not sure what steps you are running or how you’re running them. Could you post more of your pipeline? What region are you sequencing and with what chemistry? I suspect that you might need to consult this blog post…

Thanks,
Pat