Error in classify.otu when using subsetted data

I have a problem with classify.otu–command. First I have had a larger dataset which I have processed through several steps including classify.seqs and remove.lineage. After those steps, and before I used dist.seqs, I decided to split the dataset into smaller subsets. This is because the large dataset included several different sampling sets that do not need a common otu table and because this way I was able to reduce the computing time and memory need for dist.seqs and cluster.

Now, when I run classify.otu with the taxonomy-file (produced before subsetting) and the subsetted dataset I get a huge list of sequences that “are not in my taxonomy file”.

Of course, I could go back and separate the sample sets already in an earlier phase of the protocol and use couple of days for running all the steps again. However, a direct solution would be nice and, especially, it would be nice to understand why this did not work. I have understood that taxonomy-file includes only information about sequences and taxonomy (i.e. no group information). I don’t understand why it doesn’t work with a subset of the original dataset.

Here are my commands, output file names and first lines from the output (instead of real group names I used “A1-A2-A3” because the real list is too long for the example):

classify.seqs(fasta=current, name=current, group=current, template=trainset14_032015.pds.fasta,, cutoff=80, processors=8)

Output File Names:

remove.lineage(fasta=current, name=current, group=current, taxonomy=current, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)

Output File Names:




get.groups(fasta=stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.fasta, name=stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.names, group=stability.good.good.pick.pick.groups, groups=A1-A2-A3)

system(cp stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta subset.fasta)

system(cp stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.pick.names subset.names)

system(cp stability.good.good.pick.pick.pick.groups subset.groups)

dist.seqs(fasta=subset.fasta, cutoff=0.15, processors=16)

Output File Names:


cluster(column=subset.dist, name=subset.names, method=nearest, cutoff=0.2)

Output File Names:




classify.otu(list=subset.nn.list, group=subset.groups,, name=subset.names, label=0.03)

0.03 18356
M03602_100_000000000-AWHUU_1_1101_24350_3388 is not in your taxonomy file. I will not include it in the consensus.
M03602_100_000000000-AWHUU_1_1101_26219_4656 is represented by M03602_100_000000000-AWHUU_1_1101_24350_3388 and is not in your taxonomy file. I will not include it in the consensus.

I also tried to subset the taxonomy file and use that but the result was the same:

get.groups(group=stability.good.good.pick.pick.groups, groups= A1-A2-A3,

classify.otu(list=subset.nn.list, group=subset.groups, taxonomy=current, name=subset.names, label=0.03)

You should be able to run this:

get.groups(fasta=stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.fasta, name=stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.names, group=stability.good.good.pick.pick.groups, groups=A1-A2-A3)

but add the taxonomy file.


Thank you for the answer!

First, I tried to generate a new taxonomy-file for my subset.nn.list-file.

get.groups(fasta=subset.fasta, name=subset.names, group=subset.groups,, groups=A1-A2-A3)

classify.otu(list=subset.nn.list, group=subset.groups, taxonomy=current, name=subset.names, label=0.03)

Again, there was some sequences that were not found from my taxonomy file.

Finally, I decided to go back and split the data already before classify.seqs and run all the later steps again. Now it worked although I still don’t totally understand why… But, what really matters is that it worked :slight_smile:

Great - fwiw, i’m not sure of an application where I would recommend using nearest neighbor.


May I ask why? And what would you recommend in general?

You should consult these papers… I would strongly encourage using OptiClust, which is now the default.

1: Westcott SL, Schloss PD. OptiClust, an Improved Method for Assigning
Amplicon-Based Sequence Data to Operational Taxonomic Units. mSphere. 2017 Mar
8;2(2). pii: e00073-17. doi: 10.1128/mSphereDirect.00073-17. eCollection 2017
Mar-Apr. PubMed PMID: 28289728; PubMed Central PMCID: PMC5343174.

2: Schloss PD. Application of a Database-Independent Approach To Assess the Quality of Operational Taxonomic Unit Picking Methods. mSystems. 2016 Apr 26;1(2). pii: e00027-16. eCollection 2016 Mar-Apr. PubMed PMID: 27832214; PubMed Central PMCID: PMC5069744.
3: Westcott SL, Schloss PD. De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ. 2015 Dec 8;3:e1487. doi: 10.7717/peerj.1487. eCollection 2015. PubMed PMID: 26664811; PubMed Central PMCID: PMC4675110.
4: Schloss PD, Westcott SL. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Appl Environ Microbiol. 2011 May;77(10):3219-26. doi: 10.1128/AEM.02810-10. Epub 2011 Mar 18. PubMed PMID: 21421784; PubMed Central PMCID: PMC3126452.
5: Schloss PD, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol. 2005 Mar;71(3):1501-6. PubMed PMID: 15746353; PubMed Central PMCID: PMC1065144.

Thank you! I will consider using OptiClust from now on :slight_smile: