Differential representation when the otu of interest is hidden in 0s

sapou · September 17, 2019, 1:55pm

Dear all

I have recently started a new project where I am trying to find mutually exclusive OTUs in 2 groups of samples.
The 2 groups of subjects are: subjects that are infected with an enterohemorrhagic pathogen (24) and subjects that are not (28). What I want to find out is: bacterial OTUs that are exclusive of the enterohemorrhagic pathogen. Theoretically, I should be able to see that with a DESEQ analysis or with random forest however I am not, because -like the title says- my data is buried behind 0s.
More specifically, I have 24 samples infected and 28 samples uninfected. When making a heatmap, I can see that OTU828 could be of interest because in infected samples it is present only once and in <10 reads and in uninfected (present in 4 samples) it is in higher levels and can reach >2000 reads. However, in most samples (47/52) it does not exist.
YES, I understand that if it is such a rare OTU maybe it is not very interesting BUT in the context of my biological question: “searching for OTUs that are (almost) never present in infected samples but are only present in unifected samples (even if only sometimes)” - it does make sense.

Examining the dataset with DESEQ and random forest did not single out the OTU of interest, however if I examine the OTU with a zero-inflated binomial model I do get a very high significance.
In order to find a way to do that in a more controlled way I am doing the analysis in 2 steps: 1. keep only OTUs that the sum of their reads in the group of infected samples is <100 (that number can change depending on the dataSET) and 2. apply a zeroinfl model (pscl package in R) in each OTU - and correct afterwards depending on the number of OTUs checked.

Even if the above 2 steps do single out the desired type of OTUs, I am in doubt whether what I am doing is correct and whether I should use a zero-inflated model. I would be very happy to hear the opinion of people working in similar datasets (or not) especially because my knowledge in stats is limited.

Any advice is very very welcome,

thanks,
P

Kendra · September 17, 2019, 5:11pm

try indicator species analyses (indicspecies package). The stats behind it are very straightforward and are designed to find exactly what you are looking for.

sapou · September 18, 2019, 10:42am

Thanks @Kendra, going through the manual now, looks pretty cool!

Kendra · September 18, 2019, 2:47pm

it’s biggest drawback for microbial communities is that many organisms it calls as indicators are very rare. Most organisms that I’d be interested in are likely present but at low abundance in one group and present at high abundance in another, that distribution wouldn’t be called an indicator by indicspecies.

If you look at line 214 in my generic data exploration r script, you can pull my code for making a spreadsheet with the indicator species, correcting the p-values, and attaching the taxonomic info for the otus. https://github.com/krmaas/bioinformatics/blob/master/Generic.data.exploration.Rmd

sapou · September 19, 2019, 6:12am

Thanks again! I think it is great, the correlation function in multipatt and the negative/positive values (showing preference or avoidance of a niche) is exactly what I want! Based on what I am interested in, I care more about presence (regardless of abundance)/absence and so far it seems to produce as significant (or close to significance) exactly the OTUs I want. Of course this is only 1 dataset I am looking at and I will have to see how well it plays with others.
Thanks also for sharing the script, much appreciated!

sapou · September 19, 2019, 4:32pm

So going through the paper now (https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/j.2041-210X.2012.00246.x), the main difference from deseq is that the assumption of the method is that good indicator species should be restricted to the target site group and therefore when you have a species that may be present at low abundance in one group and present at high abundance in another, it does not fall within these parameters.
Did I get that right?
sorry for asking all the time - just wanna make sure I get the difference

Kendra · September 19, 2019, 8:08pm

yes that’s correct. an indicator species is both pure (only in the one group) with high fidelity (most samples in that group have the species)

system · September 29, 2019, 8:08pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
final taxonomy file - many OTUs with zero representation in samples??? Commands in mothur	7	3296	January 18, 2016
Test differential representation results in an independent dataset? Theory behind mothur	3	404	December 11, 2021
Proving there is a problem with the input Theory behind mothur	4	5613	February 3, 2012
Unclassified reads in OTU data Theory behind mothur	5	1816	May 17, 2017
Filtering reagent contamination Theory behind mothur	6	2527	April 25, 2017

Differential representation when the otu of interest is hidden in 0s

Related topics