When to combine data from multiple runs, ongoing project

I have a project that gets new sequences ever few weeks. When I’ve combined multiple runs before, I merged the fasta and groups after make.contigs. But with this project, data will continue trickling in for the next year+ and people want to see how the new samples compare as it comes in. So, I was thinking of running each set through the MiSeq SOP up through chimera checking, degap that fasta and keep the degapped fasta and count table as the “finished” product for the run. When I’m ready to analyze again I’d merge the fastas and count tables (can I just cat count tables?), align and filter again, classify seqs… Does that make sense? Is there another point in the process where it’d make more sense to stop?

Technically… If all you were after was classification, I would do as you suggest. If you want OTUs, then you probably need to go back to pooling the output from make.contigs.

Scientifically… I would really encourage your collaborators to cool their jets and just let the data come in. There are real concerns about batch effects that are introduced by doing the analysis like you are. Samples processed today may be processed slightly differently in the future and so you may see a temporal pattern that is really a laboratory artifact. A counter measure against this would be process samples over multiple weeks so that the sets of samples aren’t totally independent. But like I said, I’d wait until the samples are in and then process everything together. Batch effects have been a huge problem in GWAS and microarray data and I’m pretty sure their a big problem in microbiome data too. For example, this came up in the HMP analysis where half the samples were processed in St. Louis and half in Houston. Of course, the processing and sampling populations were perfectly confounded. The variable with the largest effect in the study? The city where the samples were processed.

Hi Pat, I’m after OTU’s but think that going through chimera checking (which is before clustering) would give me that? I’m confused why you are saying that it wouldn’t. I am wanting to avoid chimera checking the same sequences over and over since that is a computational expensive step.

Ah, I see in my question the “align and filter again, classify seqs…”, the elipsi was meant to indicate all of the downstream commands-cluster.split, make.shared, classify.otu, through alpha and beta diversity calcs.

I certainly hear you on the batch effect but this is a large enough sample set that it can’t be run all together, so there’s no getting around batches. In previous projects, I randomized replicate samples to minimize the plate effect which was mostly possible because a huge set of samples came in all at once. This project that isn’t likely given how quickly they want data. The subject recruitment should be fairly random, so at least there should be all the subjects that have a certain condition being run at once. But all samples from an individual are usually on the same run-it’s not ideal

If you’re just going to use classification, then yeah, you could pool everything after chimera checking. The problem with doing OTUs is that because of the filtering step, your alignments won’t be compatible between batches. Classification is alignment-independent but OTUs depend on the alignment.

yes, I would have to realign every time