I’m trying to create a pipeline in Mothur for classifying full-length 16S sequences down to the species level. Though significantly slower, I’ve found the various KNN implementations to be significantly more accurate with my data at the genus level, and the full SILVA databases to be more accurate than the provided subsets, so I started from there.
Using my pipeline with my test dataset of error-free sequences from 20 different genomes:
unique.seqs(fasta=Even_By_Gene_1500_1_ref.fa)
align.seqs(fasta=current, reference=silva.both.align, flip=t, processors=15)
chimera.uchime(fasta=current, reference=silva.gold.align, processors=15)
remove.seqs(fasta=current, name=current, accnos=current, group=Even_By_Gene_1500_1.groups, dups=T)
filter.seqs(fasta=current, vertical=T, trump=., processors=15)
unique.seqs(fasta=current, name=current)
pre.cluster(fasta=current, name=current, diffs=5)
classify.seqs(fasta=current, name=current, template=SSURef_111.fa, taxonomy=SSURef_111.species.tax, method=knn, numwanted=1, cutoff=80, processors=15)
But no matter what I do I’m not getting any classification down to the species level. For example, my Lactobacillus gasseri sequence is classified (k=1,3, or 5) as:
Lactobacillus_gasseri_1 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;unclassified;unclassified;unclassified;unclassified;unclassified;unclassified;
While according to BLAST, the top 5 hits were:
AB008209.1.1566 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
AB517146.1.1533 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
ACGO02000005.269.1829 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
ACOZ01000018.102.1669 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
ADDU01000006.362.1914 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;gasseri
Any suggestions, or ideas for what I’m doing wrong here?
-Brett