I have a bunch of fungal ITS2 sequences and have been processing them following the SOP. I’m to the classification step and am using UNITE ITS2. When I use k=8 ~25% are unclassified (blasted a few of the unclassified and they came back as perfect matches to sequences in UNITE). Tried k=7, fewer unclassified but still more than 20% and still sequences that blast says should have been classified. Tried k=6 and suddenly all are classified at least to phyla. Naturally when something suddenly works, I worry what I might be doing wrong. Any insight why I’m getting such a difference between k=7 and k=6?
I suspect your phyla all have at least 6 sequences in them. Remember taht this method works by taking the k-closest matches and reporting the complete consensus taxonomy. So if k=7 and a phylum in your reference only has 6 sequences in it, that phylum will never get detected.
I was using the default method which I thought was Wang where k=kmer size, not the knn where k=#nearest neighbours. Is knn the default?
Argh, sorry. Not enough sleep… You’re correct.
The k-size will depend on the length of the sequences and the number of sequences in each taxon. Unfortunately, it can only really be determined empirically by doing a leave-one-out test to see how classification accuracy depends on kmer size. The other consideration is that larger k-sizes will take more time to do the classifications. In general a kmer size of 7 or 8 seems to work well in the stuff we’re testing.
ok thanks, I’ll add LOO on the UNITE db to my to do list