I have a problem with classify.otu–command. First I have had a larger dataset which I have processed through several steps including classify.seqs and remove.lineage. After those steps, and before I used dist.seqs, I decided to split the dataset into smaller subsets. This is because the large dataset included several different sampling sets that do not need a common otu table and because this way I was able to reduce the computing time and memory need for dist.seqs and cluster.
Now, when I run classify.otu with the taxonomy-file (produced before subsetting) and the subsetted dataset I get a huge list of sequences that “are not in my taxonomy file”.
Of course, I could go back and separate the sample sets already in an earlier phase of the protocol and use couple of days for running all the steps again. However, a direct solution would be nice and, especially, it would be nice to understand why this did not work. I have understood that taxonomy-file includes only information about sequences and taxonomy (i.e. no group information). I don’t understand why it doesn’t work with a subset of the original dataset.
Here are my commands, output file names and first lines from the output (instead of real group names I used “A1-A2-A3” because the real list is too long for the example):
classify.seqs(fasta=current, name=current, group=current, template=trainset14_032015.pds.fasta, taxonomy=trainset14_032015.pds.tax, cutoff=80, processors=8)
Output File Names:
stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy
stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pds.wang.tax.summary
stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pds.wang.flip.accnos
remove.lineage(fasta=current, name=current, group=current, taxonomy=current, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)
Output File Names:
stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy
stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.names
stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.fasta
stability.good.good.pick.pick.groups
get.groups(fasta=stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.fasta, name=stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.names, group=stability.good.good.pick.pick.groups, groups=A1-A2-A3)
system(cp stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta subset.fasta)
system(cp stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.pick.names subset.names)
system(cp stability.good.good.pick.pick.pick.groups subset.groups)
dist.seqs(fasta=subset.fasta, cutoff=0.15, processors=16)
Output File Names:
subset.dist
cluster(column=subset.dist, name=subset.names, method=nearest, cutoff=0.2)
Output File Names:
subset.nn.sabund
subset.nn.rabund
subset.nn.list
classify.otu(list=subset.nn.list, group=subset.groups, taxonomy=stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, name=subset.names, label=0.03)
0.03 18356
M03602_100_000000000-AWHUU_1_1101_24350_3388 is not in your taxonomy file. I will not include it in the consensus.
M03602_100_000000000-AWHUU_1_1101_26219_4656 is represented by M03602_100_000000000-AWHUU_1_1101_24350_3388 and is not in your taxonomy file. I will not include it in the consensus.
M03602_100_0000
I also tried to subset the taxonomy file and use that but the result was the same:
get.groups(group=stability.good.good.pick.pick.groups, groups= A1-A2-A3,
taxonomy=stability.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy)