Compare Samples With Sequences of Different Lengths?

Hello, this post is mainly a question about how to compare samples from different experiments/16S regions.

I’m trying to conduct a meta-analysis of different studies looking at the same biological process but with different designs. The experiments use 3 different regions: V3-V4 (n=2), V4-V5 (n=1), and V4 (n=1). My approach has been so far to follow the MiSeq SOP for each dataset, and then try to merge the sequences/taxonomy abundances post-pipeline. I’m using PICRUSt2 to do functional predictions, and plan on binning the functional abundances based on the respective OTUs’ genus level classifications, an approach that I think is imperfect but “good enough”. However, I would also like to compare the community structure of samples from different experiments. Does anyone have an idea on how I could do this? Feel free to critique my methods as well, this is my first time making a pipeline like this.

vr,
:saluting_face:

Hi there - I would analyze those regions separately, conduct your analysis separately, and then pool the results. Generally I discourage this type of analysis since it splits your N unnecessarily. Keep in mind that each primer set and region has it’s own biases, strengths, and weaknesses. If the n=# is the number of samples you have, I don’t know what type of analysis you’ll really be able to perform since those numbers - even if pooled - are quite small. Here is an example of how we have done this in the past…

https://journals.asm.org/doi/10.1128/mbio.01018-16

As another critique… I wouldn’t bother with picrust. The genes and functions you’ll get back are largely generic. Also, metagenomic analysis for this type of work is somewhat suspect. If you have one taxon increase/decrease in abundance, then all of its 4000 genes will also increase/decrease. Which one is the most important? Pretty impossible to say IMHO.

Pat

Thanks for the feedback, @pschloss. That’s pretty much what I expected, and I agree with your remark about PICRUSt. I will definitely emphasize the limitations of this approach in the report I write. From what I have found with the available data certainly makes your blog post still relevant. I do not want that “grimy feeling that makes you want to take a bath”. May I use your response as a reference?
All the best, thanks again.

edit: I’m only really focused on ~20 genes from the CAZy family, so it’s not like I’m looking for the needle in a hay stack. The key for me is how the communities’ functions (eg. cellulase/pectinase “abundances”) are correlated to specific points in the process. Does this make the implementation of PICRUSt2 slightly more relevant?

No problem - feel free to use me as a reference :slight_smile:

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.