Minimal R relative abundance question

Hi all, I am following the minimal R tutorial to analyze the relative abundance of my sediment microbiome samples. I am always confused no matter how many times I do this why in the tutorial the y axis goes to 100 when:

  1. its just the top phyla so how would it even equal 1 or 100? for example my top phyla (i did top 6) only reaches 64.5% so the rest would be rare phyla, why in the tutorial does it equal 100?
  2. in the tutorial the abundance isnt multiplied by 100 at any point so why is it 100 and not 1 in the graphs anyways?

when i check if my relative abundance for all my samples equals 1, they do, even when i use my subsampled cons.taxonomy file, BUT when I do top phyla only they don’t obviously.

I feel like I am missing some fundamental topic to understanding relative abundance graphs

sorry if this is the wrong topic to put this under but I am at a loss

Hey there - I can’t tell you how happy I get when I see people using these resources. Thanks for your question!

Is this figure you’re asking about? I’m grabbing this from Session 9. It’s the figure shown right before Activity 5 and the “Hypothesis testing” section…

1. its just the top phyla so how would it even equal 1 or 100? for example my top phyla (i did top 6) only reaches 64.5% so the rest would be rare phyla, why in the tutorial does it equal 100?

Keep in mind that these are human fecal samples and the y-axis is on a log scale. Human fecal samples are dominated by taxa from two taxa - Firmicutes and Bacteroidetes. In this case the median Firmicutes is at about 80% and Bacteroidetes is at about 20%. The missing portion is largely made up of small amounts of Actinobacteria, Proteobacteria, and Verrucomicrobia. The median for these three phyla is right around 1-2%. I would totally expect data from other environments like sediments to be more diverse at the phylum level leading to more phyla, lower median relative abundances, and greater IQRs than we see here.

2. in the tutorial the abundance isnt multiplied by 100 at any point so why is it 100 and not 1 in the graphs anyways?

If you look at the next to last line of the code block immediately above this firgure, you’ll see this:

scale_y_log10(breaks=c(1e-4, 1e-3, 1e-2, 1e-1, 1), labels=c(1e-2, 1e-1, 1, 10, 100)) +

In this case, I changed the label for these breaks in scale_y_log10(). You could certainly multiply agg_rel_abund by 100 and leave out the labels option. You could do that in the ggplot() line by doing y = agg_rel_abund * 100. To get the same y-axis breaks you would replace my breaks values with what I had for labels.

Let me know if I’ve got the wrong figure or if you have further questions. Thanks again for giving them a try!

Pat

Thank you so much for your reply, so its fine that my top 6 (my y-axis is also on a log scale) total up to 64.5%? Whenever I see relative abundance graphs they all add up to 100% even when its only the top most abundant microbes so It made me pretty nervous to see this difference. I also noticed that you used the subsampled data (I also did), does subsampled data give us a better idea about the relative abundances because they all have the same number of sequences? I also attached my graph in case you wanted to see the fruits of my labour haha

You can always plot more phyla - but are you sure they only add up to 64%. It’s hard to interpret a log scale, but for the Fall/Winter group the first phyla seems to be around 60%, then 15%, then the next four are around 5%.

Can you post the output of running this?

agg_phylum_data %>%
  group_by(phylum) %>%
  summarize(median=median(agg_rel_abund)) %>%
  arrange((desc(median)))

Pat

Hi! Thank you for your reply, here is my output from that code, the rest of my code is the same as the tutorial :smiling_face_with_tear:

Just doing the math in my head, those get you close to/over 90%. I think that’s pretty good for a sediment ecosystem.

Could you try it with the mean instead of median?

This is what I get when I do mean instead, Im unsure why to use either the mean or median? what story does either tell thats different about my microbes?

I would use the median since your data are not normally distributed.

Pat

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.