problems combining data from different runs

I’ve been trying to analyze MiSeq data originated from more than one run without success. Initially I tried copying the fastaq files from the second run into the first run’s folder and included the files’ names of my stability.files. After contigs assembling I got over 2,800,000 reads, but with high number of ambiguous bases. In fact, after screen.seqs I loose about 1 million reads, which I can see that most of it belong to samples from the second run. Interestingly, that doesn’t happen when I try to run the same samples in parallel without pooling them. I have also tried to combine the fasta and groups files using merge.groups without success.
So it seems to me that the problem occurs while I am copying my files. Any advices or previous experiences with this issue?
Thanks for your help.
ps: I have to download and extract the original data into a pc, as it won’t work on the Mac. Then I copied the fastq.gz files and extracted then using a Mac.

Could you try running make.contigs on the runs in separate folders and then pool the the outputs?

Hi Pat,
it seems that this worked well, however I am having other issues. I was still loosing around a million reads after using screen.seqs on the aligned sequences. So I performed the analysis on the two runs individually and I saw that they align differently with the silva database:
97.5% of sequences from run 1 start at position 3062 and end at 13425 and 97.5% of sequences from run 2 start at 3082 and end at 13400. When I run then together I got 97.5% starting at 3082 and ending at 13425, so most of the groups from the run 2 were been eliminated. When I set start=3082 and end=13400 I was able to keep those reads.
For my final analysis I will have samples in several different runs. Am I gonna have to run individual analysis for each run in order to determine the best parameters to use?

Hmmm. I guess I would want to know why you have different starting points. Our experience has been that things are pretty consistent across runs. Once you get a set of parameters that work for a primer set you should be good to go.

Hi Guys

We’ve had the same problems and been told to use QIIME as it less stringent than Mothur and can cope with different runs.



So… you want to use a less stringent strategy to processing data? I’d think that as scientists we would want the best possible data :slight_smile: