Merge data

Hello Mothur,

I have two different chemistry one is 2x300 and and 2x250 for same set of samples and i would like to merge to increase the number of reads for few samples.
So for now I have run both the runs separately upto chimera removal but how can we merge?

Thank in advance!!
Best,

I probably would have merged them all way back at the make.contigs step. Thats’ probably the easiest thing to do.

Pat

Thanks you Pat
But does this means during contig making different length won’t cause problem during screening?

Right - sorry - probably best to merge everything right after make.contigs.

pat

Hi everyone,
I have a related problem regarding the merging of datasets. We have samples from two years that were sequenced together (Miseq). We tried to run everything through the miseq SOP, but kept running into memory problems (we are a small school and have a small cluster)…after many months of mucking about and some memory increases we still couldn’t get it all of the way through. After splitting the data by year we had no problems getting each through the SOP and I have now analyzed them (OTUs) independently (and they are each quite interesting). Now I’d like to compare years. After reading some of the related posts on merging data, there is always the suggestion to go back to the beginning and run them together…which puts us back to our original dilemma. Is there not ok/possible to merge the data after the data cleaning? I recognize that doing it this way limits the power of chimera checking, but I don’t quite understand why it would still not be kosher (and I am a bit fuzzy on how or whether you would need to uncluster (and deunique) before merging and then recluster (and reunique) sequences after merging, but before OTU determination). Thank you for helping us to resolve this issue.
best,
Chris

Hi Chris,

The latest you would want to merge would be right before running dist.seqs/cluster or cluster.split. The way we run chimera.uchime in the SOP detects chimeras by sample, so it doesn’t depend on the other samples. Have you tried running cluster.split?

Pat

Hi Pat,

That is a relief. Yes, we ran cluster.split (with the settings as described in the SOP). Which files need to be merged? Would it be sufficient to use merge.files to concatenate the FASTA and taxonomy files from each run, and then merge the count_table (by hand I assume)? Is there more that would need to be done prior to re-running the cluster.split on the merged data? Thank you so much for your help!

best,

Chris

Actually, now that I think about it, you would need to make sure that the alignments are the same after you do the filtering and so it might be easier to merge everything before align.seqs. You would want to merge the fasta, names, and groups file or the fasta and count file.

Pat

Hi Pat,

Since the workflow we used was the same, including aligning to the same region (having used the pcr.seqs) and screening the sequences for the same start and end position, shouldn’t the alignments be the same? I guess I am wondering how we could improve the alignment (or be more sure of it) by rerunning it with the combined data. Also, merging the data at this point won’t solve our computational problem of not making it through uchime with the full dataset. Our goal is to merge post uchime. Doing the merger before doing cluster.split seems to make sense to me (or doing it prior to classify.seqs). If we do try to merge prior to cluster.split, then would it just be the FASTA and taxonomy files (by concatenating them), as well as the count table (merged by hand)? Would we also need to merge the names and groups files, or wouldn’t that be redundant to the merged count table?

Thanks for being so responsive and helping us push this through.
best,
Chris

The problem comes in the filtering. When you run filter.seqs on different datasets, there might be different columns that are removed. So if you took your align files and concatenated them you should be ok to then run it through filter.seqs.

Pat