classify.otu with normalised data

after processing my 16S rRNA pyrosequencing data (which followed more or less your SOP), I want to normalize the number of sequences in each sample. Both commands sub.sample and normalize.shared worked well with a shared file. That’s very useful for multivariate analysis of the bacterial communities. However, I also want to compare the community structure across the samples at different taxonomic levels. The classify.otu command (using the group option) provides the respective data, but I cannot use it with the normalised shared file. Is it possible to normalise the list file in the same way as applied for the shared file or any other option to run the classify.otu command with a normalised data set?

I tried the commands phylotype + make.shared + sub.sample separately for each taxonomic level as a workaround. However, I noticed a difference in the number of OTUs and the number of sequences per OTU with the two lowest taxonomic levels (label = 1 or 2) comparing the summary file from classify.otu and the shared file from the make.shared command.

What do you think what is the best way?

Thank you

I’m not sure why the classify.otu output would be affected by subsampling. The number in the parentheses for the output is the fraction of sequences in the OTU that have that classification. So it’s a relative abundance and we only accept those over 50% as being valid.

Sorry, my question is not about the way to get the consensus taxonomy for each otu.
I think that the summary file of classify.otu provides very helpful information (using the group file and basis=sequence or otu) about the community composition at different taxonomic levels. You can compare the number and the relative abundance of the taxa between different treatments (of course by using replicates).
If I subsample my data this reduces the number of total otu and this should also reduce the taxa present in the single samples.
Therefore it would be great to be able to run it with the same number of sequences per sample.
Many thanks for your help

Sorry for the confusion. Here’s a work around that you can try…

  1. Run sub.sample with the list file
  2. Run list.seqs on the resulting list file
  3. Use get.seqs with the output and the original taxonomy file with dups=T
  4. Use with the newly subsampled taxonomy file and names file

We’ll work on allowing a taxonomy file to go into the sub.sample command for a future release.

Hope this helps,

Many thanks for your quick response.
I tried the sub.sample command with the list file. But this only produced a subset of the total sequences independent of the samples. Is it necessary to include a specific “group option” or something like that to normalize the number of sequences in each sample?

Sorry for this tedious workaround, the taxonomy file will be added as an option in sub.sample in our next release.

  1. sub.sample(, group=final.groups, persample=t) - selects same number of seqs from each group
  2. list.seqs(list=current) - lists the seqs in the subsample

-work around to avoid issues with sequence names, this will be resolved in next release
3. remove.seqs(, accnos=current) - creates list file with seqs not in subsample
4. list.seqs(list=current) - lists seqs not in sample sample
3. remove.seqs(taxnonomy=yourTaxonomyFile, name=yourNameFile, dups=f, accnos=current) - removes all seqs in not the subsampled list file.

  1., name=current)


Thank you so much, it works very well.
Best wishes,

I’m sorry, I’m new with using Mothur and I really could use some help. :oops:

I am analyzing 8 samples (16S clone libraries) and had the same question about the possibility to run the classify.otu command on the subsampled dataset. I tried what it’s explained here but I ended up with classification of sequences instead of OTUs.
Following more or less the SOP tutorial, I’ve got a subsampled file in which all my 8 samples were subsampled to 96 sequences each (the lowest number), running “sub.sample(, size=96)” I got the file “” reducing the number of total OTUs from 525 to 512
Then, running the classify.otu command: classify.otu(, name=final.names, taxonomy=final.taxonomy, label=0.03)
I got the classification of the original 525 OTUs, not the 512 subsampled OTUs.
How should I do to run the classify.otu command on the subsample shared file?

And one more thing. Also the .rabund files from each library are based on the total number of sequences from each library and not on the final normalized amount of sequences per library (sample). Is that possible to also have the .rabund files from the subsampled set of sequences?

Thank you!

You can subsample list, name, and taxonomy files with sub.sample. You are running classify.otu on non-subsampled inputs, so you wouldn’t get subsampled outputs.

Thank you for your answer! so, should I sub-sample the final list file and then get the shared file to use as input in the rest of the commands from here on, using the subsampled list file? I mean, how should I do to get the subsampled list file corresponding to the subsampled shared file I need for OTU analyses? because I need the classification of the subsampled OTUs and a shared file for the alpha and beta diversity analysis corresponding to the identified subsampled OTUs in each sample.
:frowning: Sorrry and thanx

I dont have your data, but it looks like you want to sub.sample at least your list, name, and taxonomy, and group files.

You can then run make.shared on the newly subsampled list and group files to get a shared file containing only subsampled sequences for use in commands that require a shared file.

Thank you! I was afraid if I sub-sampled all the files sepparately, the sequences choosen to be discarded would not be the same and therefore, I should follow a specific sequence of steps or sth like that to be sure all the sub-sampled files had the same set of sequences and their corresponding names and classification. I’ll try as you suggest with my final. files :oops:

At the moment I made my last question in this thread I gave up with subsampling list, name and group files. Now I am trying again with a new dataset. If I subsample the list, names, taxonomy and group files, instead of groups containing the same number of seqs (the lowest number in a sample/group) I have that number of seqs subsample in total. Of course when I ran count.groups I have different number of seqs per sample, totalizing the amount of seqs that every sample should have (as indicated with the size parameter in sub.sample).

I tried the workaround explained above but for some reason I could not get what I want.

Is there now a way to run classify.otu with a subsampled shared file, giving as output the classification of only the subsampled OTUs?

Thank you!

Please just subsample the shared file as we do in the SOP


Pat, I did it. But then, when I want to know the affiliation of the OTUs I am working with, I have to go one by one to detect which ones are already in my taxonomy file but not anymore in my otu table, as they were thrown away during subsampling.
Is there a way to run classify.otu using a shared file? Or to get list and name files that matches with the subsampled shared file? So when I show the affiliation of the OTUs I only show those which were taken in consideration for alpha and betadiversity analysis and not all the OTUs that were built at first, before subsampling?
May be I am not understanding this properly, but I found that when I plot the classification of the OTUs to know the taxonomic composition of my samples, I have all and not only those used for analysis.

If you run classify.otu on your list file, then make.shared on your list file, and finally subsample the shared file, the OTU numbers will be consistent between the output from classify.otu and the subsampled shared file. This is what we do in the SOPs.

When I classify.otu on my list file I have 2152 OTUs classified. When I make.shared on my list file I have 2152 OTUs. But when then I subsample the shared file, I end up with 2147 OTUs. Even when 5 OTUs are not a lot, I cannot find a way to know which those OTUs are, to remove them from the taxonomy file, to finally show the classification of the OTUs as a stacked barplot (I have to show that for all samples, at genus and phylum levels) that corresponds with the OTUs in the subsampled shared file, used as input for heatmap and ordination. Anyway, in this particular case, I think that 5 OTUs won’t make the difference, as I think those 5 OTUs that disappeared with subsampling were likely “rare” OTUs with low abundance in samples. Am I right?

The OTU names in the shared file and the cons.taxonomy file are the same. It’s not an issue.