Clustering issue

Dear all,

I am following the MiSeq SOP for my intestinal microbiome dataset.

I have 146,035 representative, classified sequences, which cluster into 70,411 OTUs when using the “cluster.split” command (.dist, .count_table, .taxonomy, taxlevel=4, cutoff=0.2, method=average) at 97% similarity level. Of these, 12,279 are classified (classify.otu) as a single genus of interest (let’s call it genus X). 9828 of these are singletons. I find the numbers a bit strange here, so I wanted to try another strategy:

From the 146,035 representative sequences I “pulled out” 14,499 sequences classified as genus X (by classify.seqs), used the respective .fasta sequences for dist.seqs, and then clustered them using the “cluster” (not “cluster.split”) command (.dist, .name, method=average). (The .name file I used here contained only one sequence name in the right column per line (the sequences only represent themselves)).

In this way, only the sequences already classified as genus X was used for clustering. I then got 152 OTUs at the 97% similarity level. (997 at 98%, 9538 at 99% etc.). This makes much more sense to me. I can’t really believe there exists 12,279 OTUs for genus X, 152 sounds much more realistic. Nevertheless, I do not understand why this is. Does anyone have a good explanation or at least some thoughts on how this can be?

All comments are welcome! :slight_smile:


PS: Additional fact (don’t know if it’s important): One of my OTUs contain half of the total amount of sequences (6 mill). This OTU belongs to the same order as genus X. I split at the Order level in cluster.split, so it may be relevant…

I suspect that the problem is that you are clustering with a names file that doesn’t accurately represent the redundancy in the data. The average neighbor algorithm includes that information in making the clusters. So removing it would probably make your sequence clusters seem more similar than they really are.