I have recently started a new project where I am trying to find mutually exclusive OTUs in 2 groups of samples.
The 2 groups of subjects are: subjects that are infected with an enterohemorrhagic pathogen (24) and subjects that are not (28). What I want to find out is: bacterial OTUs that are exclusive of the enterohemorrhagic pathogen. Theoretically, I should be able to see that with a DESEQ analysis or with random forest however I am not, because -like the title says- my data is buried behind 0s.
More specifically, I have 24 samples infected and 28 samples uninfected. When making a heatmap, I can see that OTU828 could be of interest because in infected samples it is present only once and in <10 reads and in uninfected (present in 4 samples) it is in higher levels and can reach >2000 reads. However, in most samples (47/52) it does not exist.
YES, I understand that if it is such a rare OTU maybe it is not very interesting BUT in the context of my biological question: “searching for OTUs that are (almost) never present in infected samples but are only present in unifected samples (even if only sometimes)” - it does make sense.
Examining the dataset with DESEQ and random forest did not single out the OTU of interest, however if I examine the OTU with a zero-inflated binomial model I do get a very high significance.
In order to find a way to do that in a more controlled way I am doing the analysis in 2 steps: 1. keep only OTUs that the sum of their reads in the group of infected samples is <100 (that number can change depending on the dataSET) and 2. apply a zeroinfl model (pscl package in R) in each OTU - and correct afterwards depending on the number of OTUs checked.
Even if the above 2 steps do single out the desired type of OTUs, I am in doubt whether what I am doing is correct and whether I should use a zero-inflated model. I would be very happy to hear the opinion of people working in similar datasets (or not) especially because my knowledge in stats is limited.
Any advice is very very welcome,