Differential representation when the otu of interest is hidden in 0s

Dear all

I have recently started a new project where I am trying to find mutually exclusive OTUs in 2 groups of samples.
The 2 groups of subjects are: subjects that are infected with an enterohemorrhagic pathogen (24) and subjects that are not (28). What I want to find out is: bacterial OTUs that are exclusive of the enterohemorrhagic pathogen. Theoretically, I should be able to see that with a DESEQ analysis or with random forest however I am not, because -like the title says- my data is buried behind 0s.
More specifically, I have 24 samples infected and 28 samples uninfected. When making a heatmap, I can see that OTU828 could be of interest because in infected samples it is present only once and in <10 reads and in uninfected (present in 4 samples) it is in higher levels and can reach >2000 reads. However, in most samples (47/52) it does not exist.
YES, I understand that if it is such a rare OTU maybe it is not very interesting BUT in the context of my biological question: “searching for OTUs that are (almost) never present in infected samples but are only present in unifected samples (even if only sometimes)” - it does make sense.

Examining the dataset with DESEQ and random forest did not single out the OTU of interest, however if I examine the OTU with a zero-inflated binomial model I do get a very high significance.
In order to find a way to do that in a more controlled way I am doing the analysis in 2 steps: 1. keep only OTUs that the sum of their reads in the group of infected samples is <100 (that number can change depending on the dataSET) and 2. apply a zeroinfl model (pscl package in R) in each OTU - and correct afterwards depending on the number of OTUs checked.

Even if the above 2 steps do single out the desired type of OTUs, I am in doubt whether what I am doing is correct and whether I should use a zero-inflated model. I would be very happy to hear the opinion of people working in similar datasets (or not) especially because my knowledge in stats is limited.

Any advice is very very welcome,


try indicator species analyses (indicspecies package). The stats behind it are very straightforward and are designed to find exactly what you are looking for.

Thanks @kmitchell, going through the manual now, looks pretty cool!

it’s biggest drawback for microbial communities is that many organisms it calls as indicators are very rare. Most organisms that I’d be interested in are likely present but at low abundance in one group and present at high abundance in another, that distribution wouldn’t be called an indicator by indicspecies.

If you look at line 214 in my generic data exploration r script, you can pull my code for making a spreadsheet with the indicator species, correcting the p-values, and attaching the taxonomic info for the otus.

Thanks again! I think it is great, the correlation function in multipatt and the negative/positive values (showing preference or avoidance of a niche) is exactly what I want! Based on what I am interested in, I care more about presence (regardless of abundance)/absence and so far it seems to produce as significant (or close to significance) exactly the OTUs I want. Of course this is only 1 dataset I am looking at and I will have to see how well it plays with others. :slight_smile:
Thanks also for sharing the script, much appreciated!

So going through the paper now (, the main difference from deseq is that the assumption of the method is that good indicator species should be restricted to the target site group and therefore when you have a species that may be present at low abundance in one group and present at high abundance in another, it does not fall within these parameters.
Did I get that right?
sorry for asking all the time - just wanna make sure I get the difference

yes that’s correct. an indicator species is both pure (only in the one group) with high fidelity (most samples in that group have the species)

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.