make.contigs when dual index is in fastq header

Hi all - we’re migrating over from using QIIME to Mothur for our analysis of 16S V4 region, and have encountered a problem using the make.contigs function in conjunction with an .oligos file, such that a .group file is produced. I’ve read several of the posts on this topic, such as here, here, and here, but have not found a fix.

We’re running MiSeq V2 PE250, and are using the Schloss dual indexing primers. We always receive our data from the local sequencing core in the form of two fastq files, one for read1 and the second for read2. Looking at the head of one of these files shows that the dual index is actually listed in the header for each sequence (see bold text below). The reverse complement of the i7 is listed first (underlined), followed by the i5. Is there a way to have make.contigs look in the header for these indices? How would I format the .oligos file given that the dual index is listed for both read1 and read 2?

Every version of an oligos file I try results in everything going to the scrap.contigs file. If I don’t specify any oligos file, I get good contigs and can successfully run align.seqs. Of course, then I have no .group file to use downstream.

Thanks in advance for any help on this issue!

Best,

Dan

@HWI-M00590:260:000000000-AGRFM:1:1101:15605:1332 1:N:0:TAGTCTCCGGATATCT
TACGTAGGTGGCGAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGTAGGCGGACTTTTAAGTGGGATGTGAAATACTCGGGCTCAACTTGAGTGCTGCATTTCAAACTGGAAGTCTAGAGTGCAGGAGAGGAGAATGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAAGAACACCAGTGGCGAAGGCGATTCTCTGGACTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAAC
+
AAA11FF@1BAB1EEEEEG0FGHEGG?EHHHHHHHHGGGGHHGGHCEEFGGGHGG@EGFFHHEHFB1GHHGHGBFFG1BGHFEGEEHFFHHHH1FH1FGHHGHHHBFGHHGHH0@FFFGFHHHHEHHHAEGHGG0C.0><CGFFHGHE1GDDDDHHCGEGGC0<GBGG.ACG0;G0CF;0;9CFGGG.;FGG?A-99CE@-@EFFFFFFFFFEB/;FBFFFBFFF-FE–F@-9-9FBFF@@@-@#####

If you samples are already demultiplexed (R1 and R2 fastq for each sample), you don’t need an oligos file just a stability file.

F3D0 F3D0_S188_L001_R1_001.fastq F3D0_S188_L001_R2_001.fastq
F3D141 F3D141_S207_L001_R1_001.fastq F3D141_S207_L001_R2_001.fastq
F3D142 F3D142_S208_L001_R1_001.fastq F3D142_S208_L001_R2_001.fastq
F3D143 F3D143_S209_L001_R1_001.fastq F3D143_S209_L001_R2_001.fastq
F3D144 F3D144_S210_L001_R1_001.fastq F3D144_S210_L001_R2_001.fastq

If your samples are already demultiplexed (R1 and R2 fastq for each sample), you don’t need an oligos file just a stability file.

That’s just it, they’re not demultiplexed. There is only a single R1 and one R2 fastq, but they contain containing close to 300 dual indexed samples, with the dual index embedded as part of the header. So It’s not clear to me how one would go about accessing this barcode to demultiplex in Mothur.

In QIIME, this is accomplished using the extract_barcodes.py function with the option “barcode_in_label” specified. Not sure if there’s is a similar function in Mothur. Perhaps this is an uncommon way for barcodes to be stored in the fastq…anyone else have this issue?

Best,

Dan

Dan,

Welcome to the family :).

Have you tried reaching out to the sequencing provider and ask them to split the fastq files for you? Alternatively, I’m afraid you’re going to have to write a script to split them by the paired indices. This really shouldn’t be hard for them to do for you. If they have problems, feel free to email us and we can show them what we do.

Pat

Pat - Thanks for the warm welcome. I’m talking with our seq core now to have them do the demultiplexing. They’re happy to do this for us, so I think I’ll be over this small hurdle soon.

Best,

Dan

Hi all - just wanted to post on update on where I’m at with this issue, and also seek some more feedback

Based on Pat’s suggestion, I went back to our sequencing core and asked them to demultiplex the files. Every time they tried this, a significant amount of data ended up in ‘undetermined.fastq’. Looking at the reads in this file, it was clear there were certain barcodes that always ended up here. After a bit of back and forth, they realized there was a low-level error in the their demultiplexing script that caused a barcode conflict when allowing for up to one mismatch. So, they have gone back to giving me one file for all forward reads, and one for reverse. However, they have removed the barcode from the header (the initial issue that prompted me to open this thread), and instead have left them in the read itself. Now the files look like this:

Read1.fastq
@HWI-M00590:260:000000000-AGRFM:1:1101:15605:1332 1:N:0:
TACGTAGGTGGCGAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGTAGGCGGACTTTTAAGTGGGATGTGAAATACTCGGGCTCAACTTGAGTGCTGCATTTCAAACTGGAAGTCTAGAGTGCAGGAGAGGAGAATGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAAGAACACCAGTGGCGAAGGCGATTCTCTGGACTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACATAGTCTCC (bold = reverse complement of i7 index)


[u]Read2.fastq[/u] @HWI-M00590:260:000000000-AGRFM:1:1101:15605:1332 2:N:0: [b]GGATATCT[/b]CCTGTTTGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGTCCAGAGAGTCGCCTTCGCCACTGGTGTTCTTCCTAATCTCTACGCATTTCACCGCTACACTCGGAATTCCATTCTCCTCTCCTTCACTCTCGACTTCCAGTTTGAAATGCTGCACTCAAGTTGAGCCCGCGGATTTCCTATCCCACTTAAAAGTCCGCCGACGCTCGCTTTACGCAGCGTACATTCGGCGCAAGATTACAAGCCTGC (bold = i5 index)

Despite these changes to my input .fastq files, I am still unable to successfully demux (everything going to scrap) when I use the following .oligos file (only head of file shown), where second column is reverse-complement of i7 and third column is i5

barcode TAGCAGCT GATCGTGT 10638
barcode TCTCTATG GATCGTGT 10639
barcode GTAACGAG GATCGTGT 10641
barcode ACGTGCGC GATCGTGT 10642
barcode AACGCTGA CGTTACTA 10575

Any additional comments would be greatly appreciated. Happy holidays everyone!

Hey Dan,

If you’re using the method outlined in the Kozich paper, you won’t have the index sequences or the primers on the sequences. The method actually generates four files - two for the index reads and two for the sequence reads.

If these reads are examples of how you get your data back from the sequencing provider, they’re doing something screwy. They seem to have pasted your index to the end and beginning of your reads. Even if they were using our primers and then sequencing off of the adapters (not recommended - it should go barcode, pad, link, primer. Also, the reads should only be 251 nt, not 259. I suspect they may have created a hack by concatenating the index sequence to the end and beginning of the reads. This probably would have worked if they had concatenated the index sequences to the beginning of both reads.

Regardless, I would be a bit worried about their inability to demux the samples on the machine. As far as I can tell, this is done on the machine and not with some custom scripts. If they need help setting this up, have them feel free to send us me an email to my umich.edu account at pschloss.

Pat

Hi Pat - thanks again for helping with this. I got this issue resolved with our sequencing core. They now provide me with the barcodes in two separate .fastq files, and this works with make.contigs using the findex/rindex arguments.

Best,

Dan