Dear all,
I am following the MiSeq SOP for my intestinal microbiome dataset.
I have 146,035 representative, classified sequences, which cluster into 70,411 OTUs when using the “cluster.split” command (.dist, .count_table, .taxonomy, taxlevel=4, cutoff=0.2, method=average) at 97% similarity level. Of these, 12,279 are classified (classify.otu) as a single genus of interest (let’s call it genus X). 9828 of these are singletons. I find the numbers a bit strange here, so I wanted to try another strategy:
From the 146,035 representative sequences I “pulled out” 14,499 sequences classified as genus X (by classify.seqs), used the respective .fasta sequences for dist.seqs, and then clustered them using the “cluster” (not “cluster.split”) command (.dist, .name, method=average). (The .name file I used here contained only one sequence name in the right column per line (the sequences only represent themselves)).
In this way, only the sequences already classified as genus X was used for clustering. I then got 152 OTUs at the 97% similarity level. (997 at 98%, 9538 at 99% etc.). This makes much more sense to me. I can’t really believe there exists 12,279 OTUs for genus X, 152 sounds much more realistic. Nevertheless, I do not understand why this is. Does anyone have a good explanation or at least some thoughts on how this can be?
All comments are welcome!
Even
PS: Additional fact (don’t know if it’s important): One of my OTUs contain half of the total amount of sequences (6 mill). This OTU belongs to the same order as genus X. I split at the Order level in cluster.split, so it may be relevant…