I just want to make sure that this workflow for using the Miseq SOP with new data makes sense. Please let me know If I am missing something.
My sediment microbiome study has seasonal samples that are sequenced from a 3rd party vendor then the sequences are sent to us. I have all my fall/winter sequences (Batch 1) and my spring/ summer sequences (Batch 2).
My plan is to run all my .gz files for all batches in mothur together. I assume this is going to take forever though.
The other option, because Ive already done the MiSeq SOP on Batch 1, would be to just run batch 2 then combine them in R after? But I assume this will be weird because of the labels for the OTUs. I am nervous their will be different “OTU 0001” and combining them will make the outcome weird.
I am unsure if this understanding is correct or if I am missing a fundamental point of mothur.
Please let me know!
Hi
OTU assignments depend on the data that are provided. So if you change the data that are provided, you will get different OTUs. Even if OTU0001 is the same, OTU0123 will most likely be different.
The ideal would be you run batch 1 and draw your inferences. Then you would run batch 2 and draw your inferences. Then you compare your inferences between batches. This would effectively be a meta-analysis like the ones I published with Marc Sze. This approach would work if there are variables other than season that you want to compare.
If you want to compare fall/winter to spring/summer this won’t work. If this was the primary research question, then all of the samples should have been randomized, extracted, and sequenced together. By pooling everything now in silico, you have confounded season with sequencing and there’s no way to correct when there’s perfect confounding. I grant that there are likely seasonal differences, but there are also small differences that accumulate between the batches. This was a big problem with the HMP where the variable with the most explanatory power was the city where people were from (Houston or St Louis). Turns out that was perfectly confounded by sequencing center.
Pat
These things weren’t considered during the development of my project. As someone who is trying to analyze this data to observe the seasonal differences between microbial communities, do you have any advice for doing this in a publishable way?