Species-level Identification via Classify.Seqs KNN

I’m trying to create a pipeline in Mothur for classifying full-length 16S sequences down to the species level. Though significantly slower, I’ve found the various KNN implementations to be significantly more accurate with my data at the genus level, and the full SILVA databases to be more accurate than the provided subsets, so I started from there.

Using my pipeline with my test dataset of error-free sequences from 20 different genomes:

align.seqs(fasta=current, reference=silva.both.align, flip=t, processors=15)
chimera.uchime(fasta=current, reference=silva.gold.align, processors=15)
remove.seqs(fasta=current, name=current, accnos=current, group=Even_By_Gene_1500_1.groups, dups=T)
filter.seqs(fasta=current, vertical=T, trump=., processors=15)
unique.seqs(fasta=current, name=current)
pre.cluster(fasta=current, name=current, diffs=5)
classify.seqs(fasta=current, name=current, template=SSURef_111.fa, taxonomy=SSURef_111.species.tax, method=knn, numwanted=1, cutoff=80, processors=15)

But no matter what I do I’m not getting any classification down to the species level. For example, my Lactobacillus gasseri sequence is classified (k=1,3, or 5) as:

Lactobacillus_gasseri_1       Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;unclassified;unclassified;unclassified;unclassified;unclassified;unclassified;

While according to BLAST, the top 5 hits were:

AB008209.1.1566 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
AB517146.1.1533 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
ACGO02000005.269.1829   Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
ACOZ01000018.102.1669   Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
ADDU01000006.362.1914   Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri

Any suggestions, or ideas for what I’m doing wrong here?



Searching through the forums for other related problems, I saw some suggestions of deleting the *tree.sum file associated with the taxonomy. Searching through that file via grep:

bbowman@mp-f027:~/1210_16S_Classification/uchime_1500_nr_species_knn1$ grep Lactobacill SSURef_111_NR.species.tree.sum
3       Bacilli 8       4-15    966     Bacillales      5       C178B   2908    IC-BH   7399    Lactobacillales 79      SHBZ1548        7116    VAN12   4194    unclassified    9658
4       Lactobacillales 15      16d63.751       7361    Aerococcaceae   3048    Carnobacteriaceae       836     Enterococcaceae 80      LL141-7D1       7292    Lactobacillaceae        391     Leuconostocaceae       416     Ll142-1A4       7290    Ll142-3M24      7291    MOB164  5800    P5D1-392        7256    PeH08   1702    Rs-D42  1293    Streptococcaceae        288     unclassified    9766
5       Lactobacillaceae        3       Lactobacillus   392     Pediococcus     784     unclassified    9801
6       Lactobacillus   1       unclassified    9802
bbowman@mp-f027:~/1210_16S_Classification/uchime_1500_nr_species_knn1$ grep gasseri SSURef_111_NR.species.tree.sum

I can find entries for my Genus but not my Species, suggesting that the information isn’t making it out of the taxonomy file for some reason. Deleting the files and re-trying the protocol doesn’t resolve the issue. Curious…


Update: Fixed the issue - forgot to add semi-colons to the end of my taxonomy lines

Dear bbowman, using ‘silva.seed_v119.align’ file I’m having the same issue you faced two years ago.

May I ask you what exactly did you do to fix the issue?

Thank you in advance!