Combining HiSeq 2000 and 2500

Hi Mothur team,

I have amplicon data generated by the 2000 and 2500 (same V region in both cases) that I’d like to analyze together. Given the different read lengths they produce, should sequences be trimmed after alignment or after make.contigs?

Many thanks.

Hi and thanks for your question. I’m not a big fan of using the HiSeq to process 16S rRNA gene sequences since it is not currently possible to generate fully overlapping reads for our popular regions (e.g. V4). I think you’d have to process the first and second reads separately, but regardless, if you have data for two different lengths and the same region, you need to trim them to be the same length. You’d also want to make sure (using a mock community) that the error rates are the same. Regardless, you’re not going to be able to make OTUs and will need to use the phylotype approach described in the SOP. Here’s some information on the effects of high error rates:

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Pat

Thank you, Pat; that’s very helpful. (FYI, we are sequencing the V3 region , so the HiSeq may work…)

It would be great to see some mock community data from a HiSeq to see whether its error profile is similar to that of the MiSeq

Pat

I only have some prelim data here, but we had some data using HiSeq 100bp (EMP), HiSeq 150bp (EMP) and in-house MiSeq (2x250). Unfortunately, the data was from a time series, so experiment confounds the ability to distinguish sequencing effects. However, there were 5 time points (1 for 100bp, 1 for 150 bp and 3 for the 250 bp), and we included 2 samples from time 1 with the 250 bp run.

I simply used the forward reads in all, and only 100 bp.

A quick view of the data showed that all the 3 sequencing runs clustered separately from each other. So while the MiSeq ran 3 of the time points, they all clustered together, and the 2 samples from time 1 included in this sequencing run also clustered in this region.

I lost all hope in analysing the data together.

Could be different in your case. Maybe if all the sequencing was prepared by the same center?

Good luck!

Wow, that… really sucks.

I just had a look at this data again. I had actually 3 other samples in the MiSeq run that could be compared with the HiSeq 2000 (1 x 150 bp) and they clustered together. So I guess the initial HiSeq 2000 (1 x 100 bp) seemed most dodgey.

Looking at my pipeline, I simply;

Trim sequences

trim.seqs(fasta=ts.fasta, qfile=ts.qual, qwindowsize=5, qwindowaverage=25, minlength=75, keepfirst=105, maxambig=0, maxhomop=8, processors=8)

Extract trimmed sequences from the groups file (no oligos were used to generate a groups file, so the long way around)

list.seqs(fasta=ts.trim.fasta)
get.seqs(accnos=ts.trim.accnos, group=ts.groups)

Have a go and good luck!