Total newbie, totally flummoxed

I have a set of 16s RNA data that I have been trying to analyze for 2 months. After all my reading and futsing around with QIIME, mothur, uncle’s pants, aunt’s drawers and so many other programs, I cannot even figure out the very first step that I want to accomplish. I know somwhere on some web page someones has the perfect command - but geez, have I had trouble finding that one. OK so here 's the story.

Lets take my control sample. It came to me as 8 files (just one sample): Forward and reverse reads and each having 4 lanes, like this. (Illumina Next Generation Sequencing )

Control_ssr_35427__R1__L001.fastq.gz
Control_ssr_35427__R1__L002.fastq.gz
Control_ssr_35427__R1__L003.fastq.gz
Control_ssr_35427__R1__L004.fastq.gz
Control_ssr_35427__R2__L001.fastq.gz
Control_ssr_35427__R2__L002.fastq.gz
Control_ssr_35427__R2__L003.fastq.gz
Control_ssr_35427__R2__L004.fastq.gz

Expanding any one .gz file gives me ONLY one fastq file - no mapping file, no barcodes etc. etc. I am presuming that the primers/barcodes have been removed cleaned up for me??? Please assume I cannot get that info anyways. That’s another story.

  1. The overall question is how do I merge all the files into one file. I know that many people suggest to keep the lanes and analyze them separately, but I have trouble managing one file, how do I manage 4 files even after 'pairing" R1 & R2

In QIIME: join_paired_ends.py or multiple_join… gives me 4 folders with another 3 files in each…so now I have 4x3=12 files. It pairs the reads without error, but the resulting files make even less sense to me. Is there not somewhere that explains any of this?

  1. Trying to merge lanes is hell on earth for me.

So I cannot even get started :frowning: :frowning: … Any help is appreciated.

I am sorry to hear you have had such a struggle, prem.

To partly answer your question, here is one way you could combine your files with Mothur. Mothur has a function make.contigs (http://www.mothur.org/wiki/Make.contigs) which will take a 3-column format text file that you supply (format definition below), and do the following:

a. Combine the forward and reverse reads into contiguous reads, and does some error correction
b. Write all of the contiguous sequences to a single fasta-format file
c. Write a group file to keep track of which sequences came from where (http://www.mothur.org/wiki/Group_file)

Here is the format of the 3-column file you must supply for this:

Group_A Control_ssr_35427__R1__L001.fastq.gz Control_ssr_35427__R2__L001.fastq.gz
Group_B Control_ssr_35427__R1__L002.fastq.gz Control_ssr_35427__R2__L002.fastq.gz
Group_C Control_ssr_35427__R1__L003.fastq.gz Control_ssr_35427__R2__L003.fastq.gz
Group_D Control_ssr_35427__R1__L004.fastq.gz Control_ssr_35427__R2__L004.fastq.gz


If you have barcodes or primers, you could also give make.contigs a file (http://www.mothur.org/wiki/Make.contigs#oligos) to remove them.

Putting that all together, the function call would look something like this:

make.contigs(file=prems_samples.files, oligos=prems_oligos.oligos, processors=4)

Ooooh…thanks so much for your reply! :smiley:

Yes, i came across the make contigs in a tutorial, but I thought those were for separate samples.

OK, what if I do not know if barcodes/primers have been removed? What is my first step to check for that? Can I check without knowing what primers/barcodes were used?

Thanks.

PS: Is it better to manipulate lanes separately? Then combine the cleaned up data in the end…?

Did you use either Kozich or Caporaso primers? if so, the primers are not sequenced. The data is already demultiplexed and you are not getting your index fastq, all of the sequences within a file belong only that sample.

I think that you can use a stability file, using the same name for all 4 lanes (Pat or Sarah will likely jump in if this is false info). If mothur doesn’t like multiple samples of the same name, you could make.contigs then merge.groups