preclustering

Hi,
I have question regarding pre.cluster command, I am not able to understand why we are using this command after filtering step. On wiki page it says that it removes sequences due to sequencing errors if this is the case then it should be used before we align the sequences.
I have a set of 25000 sequences, which after using unique command I am left with ~7000.
When I using following commands without pre clustering everything runs fine.
unique
align
filter
dist
cluster
classify
But if I use all 25000 sequences the it stops at clustering step, it take forever and then kill the command.
I am not able to get why this is happening.

Second question is why we align everything and then make distance matrix. Is distance matrix is not based on the distance between our sequences, what is the relation with the reference sequences here.

Thanks, I know its too much for one post but I am really confused here.

I would really encourage you to read our earlier PLoS ONE paper on 454 sequence curation as well as the more recent MiSEQ paper online early at AEM. You could also read the 2012 ISMEJ paper. The mothur documentation and SOPs really assume that you’ve read and understand the underlying papers we’ve published over the past few years.

I have question regarding pre.cluster command, I am not able to understand why we are using this command after filtering step. On wiki page it says that it removes sequences due to sequencing errors if this is the case then it should be used before we align the sequences.

It denoises the sequences based on the alignment. So the sequences have to be aligned first. If you don’t do this you’ll artificially inflate the number of sequences and OTUs. The effect of inflating the number of sequences due to sequence errors is that you will suck up tons of RAM in cluster and your computer will crash. Which is what you’re seeing.

Second question is why we align everything and then make distance matrix. Is distance matrix is not based on the distance between our sequences, what is the relation with the reference sequences here.

The distance matrix is based on the distance between your sequences. The reference sequences are only used to create a common alignment across all of your sequences.

Pat