Stumped on getting started

So I’m brand new to mothur and illumina analysis in general having just started my Master’s program. I’ve run through the MiSeq SOP a few times no problem but we received a data set recently from Mr.DNA and I’m at a loss as to where to start.

We were provided with the raw data in R1 and R2 .fastq files from a paired end MiSeq run and I was able to generate my own oligos files based off the mapping file that was also given. Whenever I try to make.contigs using the two fastq files and my oligos file everything gets put into scrap rather than trim. Speaking to a post-doc that previously worked in this lab, "the R1 and R2 fastq files are not in the proper orientation (not all of the sequences present are in the forward direction). So, roughly half of the sequences in the R1 and R2 files are in the opposite orientation. Those that are will fail on the barcodes (since there is a barcode only on one strand). " I don’t know if there’s a workaround for that or not. I’ve tried generating fasta and qual files from the fastq files using fastq.info but if I then try to run trim.seqs on those files using

trim.seqs(fasta=horse_R1.fasta, qfile=horse_R1.qual, qaverage=25, flip=T, oligos=dowd.oligos, processors=8)

This too also puts everything in scrap.

If instead of starting with the 2 fastq files I start with 071814UD515F-full.fasta and 071814UD515F-full.qual which are files from the MR.DNA pipeline and are supposedly “These files contain the raw sequence data information and still have primers and barcodes” I run into similar issues where everything is scrapped if I run the same trim.seqs on the file.

If I remove the primer sequences from my oligos file then the vast majority makes it into trim with only those of poor quality going into scrap. This still has the issue though of the primers being present in my sequences. If I continue trying to process these files I eventually run into issues when I do unique.seqs

[ERROR]: You already have a sequence named M01522_177_000000000-AAWEC_1_1102_10047_3701 in your fasta file, sequence names must be unique, please correct.

and if I try to do count.seqs I get the error:

[ERROR]: processes reported processing 285854 sequences, but group file indicates you have 285892 sequences. Either you have a file mismatch or a process failedto complete the task assigned to it.

I’ve uploaded all the files in question to a googledrive and would gladly share it with anyone if they think they can help.

https://drive.google.com/folderview?id=0B_h5mHtOUQsISFNRTndwNE5QbkU&usp=sharing

I think once I get over this initial hurdle I shouldn’t have too many problems but I can’t even get started on the analysis.

Thanks,
Tyler

… we received a data set recently from Mr.DNA and I’m at a loss as to where to start

You poor soul. I would actually be interested in seeing any material they sent you. You guys aren’t the first to struggle with MrDNA-generated data. Here’s what I’ve done…

make.contigs(ffastq=SAMPS1-3_S7_L001_R1_001.fastq, rfastq=SAMPS1-3_S7_L001_R2_001.fastq, processors=8)
trim.seqs(fasta=SAMPS1-3_S7_L001_R1_001.trim.contigs.fasta, oligos=horse.oligos, checkorient=T, pdiffs=2, bdiffs=1)
summary.seqs()

Here’s the number of sequences per group out of trim.seqs:
Brown.Dolly 5603
Swab.Dolly 121542
White.Dolly 1984
Total of all groups is 129129

There were another 278953 sequences that went to the scrap heap. I think there might be a few small problems in your oligos file. For example, when I changed the line:

barcode NAGTCTGT Swab.Dolly

to

barcode NGAGTCTGT Swab.Dolly

the distribution changed to:
Brown.Dolly 5630
Swab.Dolly 121582
White.Dolly 24295

Hope this helps,
Pat