Getting data for NCBI submission

Hello!

I have a fasta file and a qfile from an illumina MiSeq run that has barcoded samples. I’ve finished all my analysis and been able to get mothur to work, but I’m trying to export in separate groups so I can submit it to the NCBI. I was able to break up my fasta file into all the separte communities I sequenced, but the NCBI told me they want either a fastq, or fasta/quality files. Is there anyway I can take one of my fasta files and have mothur look up the sequence name in the qfile and write it to a new qfile?

Thank you!
Eric

First up, since presumably you get your MiSeq data in fastq format to begin with can’t you just upload that, or the output of trim.seqs?

If you can’t for some reason…I don’t know if you can do this in mothur specifically but you can do this with a few console commands.

If you have your fasta file of interest, you could do this:

grep ">" final.fasta | cut -f2 -d ">" > seqs.list
grep -f seqs.list -A1 original.qual > final.qual

The first line just gets a list of the sequence names from the final fasta file. You could use sed instead of cut to remove the ‘>’ but I’ve had some bad experiences piping into sed so I tend to avoid it. The second list uses the list of sequence names as the search parameters for the second grep, which returns the lines the sequence names occur on, and the line that follows. This won’t work for any data, because there is a convention of writing sequences into 80 character lines, but mothur doesn’t do this so I think the above will work fine.

You can then stitch the files together into fastq using:

make.fastq(fasta=final.fasta, qfile=final.qual)

Which would be a good idea, because mothur will complain if anything has gone wrong (wrong number of sequences etc).

You can use mothur’s get.seqs and remove.seqs commands to select sequences from the files as well. http://www.mothur.org/wiki/Get.seqs http://www.mothur.org/wiki/Remove.seqs

mothur > list.seqs(fasta=final.fasta) - list sequences in the fasta file
mothur > get.seqs(qfile=final.qual, accnos=current) - select those sequences from the quality file
mothur > make.fastq(fasta=final.fasta, qfile=final.qual) - create a fastq file from fasta and quality files.

You may also be interested in the fastq.info command. It can also be used to separate your fastq files by sample using a oligos file or group file. http://www.mothur.org/wiki/Fastq.info

mothur > fastq.info(fastq=final.fastq, group=final.groups)

or

mothur > fastq.info(fastq=originalFile, oligos=yourOligos, pdiffs=2, bdiffs=1)

I used paired reads (MiSeq data) for the analysis which not totally overlap.
I started processing with make.contigs and removed primer and barcodes with trim.seqs. This worked very well.
But now I want to parse the original fastq files by sample to submit it to SRA (NCBI).
I tried the fastq.info command but it did not process my files.

fastq.info(file=parse.file, oligos=BD3.oligos, pdiffs=2, bdiffs=1, fasta=f, qfile=f)

The parse file has two columns, something like that : BD3_1_1.fastq BD3_1_2.fastq
If I used an oligo file with paired primers (which I also used for the trim.seqs command) all reads were put in the two scarp files.
The oligo file was as follows:
primer GATTACCGCGGCTGCTGG TTTGATCMTGGCTCAG
barcode TATTGAC AGAG C01f

Just to explain the data: The forward reads start with the barcode and forward primer and the other read with the other barcode and the reverse primer.

I also tried an oligo file containing only the barcode and forward primer. But in this case I got the error message: Could not open
Oligo file:
forward GATTACCGCGGCTGCTGG
#reverse AGAGTTTGATCMTGGCTCAG
barcode TATTGAC C01f

What am I doing wrong with the fastq.info?
Many thanks for your help
Andreas

I thought I have found a workaround by using an oligo file with replacing “none” for the reverse primer and barcode.
Oligo file:
primer GATTACCGCGGCTGCTGG none
barcode TATTGAC none C01f

It works but I noticed that some of my final sequences were not in the parsed fastq files.
Are there any reason for that?
If I use make.contigs and subsequently trim.seqs I get about 11% more sequences as compared to the fastq.info command.

Further, I noticed that barcodes and primer were not trimmed off from the parsed fastq files.

Is it possible to trim the fastq files during parsing the data set?
And could you please help with the right syntax for the fastq.info command?

Many thanks
Andreas

Could you post the make.contigs, trim.seqs, and fastq.info commands you are running?

The commands were
mothur > make.contigs(ffastq=BD3_1.fastq, rfastq=BD3_2.fastq, processors=32)
mothur > trim.seqs(fasta=current, oligos=BD3.oligos, pdiffs=2, bdiffs=1, checkorient=t, flip=T)
The number of sequences differed slightly between v.1.30.2 (which I used to process my data set) and v.1.34.4.
Using v.1.34.4 it resulted in 19981298 seqs.
The oligo file was as follows:
primer GATTACCGCGGCTGCTGG TTTGATCMTGGCTCAG
barcode TATTGAC AGAG C01f

mothur > fastq.info(file=parse.file, oligos=BD3-neu.oligos, pdiffs=2, bdiffs=1, fasta=f, qfile=f)
The fastq.info did not work with BD3.oligos (0 parsed sequences)
Therefore I changed it to:
primer GATTACCGCGGCTGCTGG none
barcode TATTGAC none C01f

I performed the fastq.info two times with different order of the two fastq files in the parse.file and merged the resulting fastq files.
(For sequencing of the amplicons, I used primer without the adapter which were ligated subsequently.)
So far I have not checked any group but e.g. C01f contained 596250 seqs. using trim.seqs and 528735 using fastq.info
Due to the removed reverse primer, I actually assumed some more sequences.
Maybe it is the wrong way to apply “none” for the reverse primer.

Thanks Andreas

Can you try fastq.info with the checkorient parameter set so it mirrors the trim.seqs command?

I already tried the checkorient parameter but it did not change the output.
I guess that the checkorient parameter looks for the primer/barcode in the complement sequence of the forward read but it does not check the other (reverse) read (of the second fastq file).
Do I have this right?

Thanks! This almost worked. When I use list.seqs I get an accnos file with 533 sequences. When I use get.seqs I get a quality file that has 532 sequences. Any idea why it’s dropping a sequence? And I don’t know if it matters, but the quality file is not in the same order as the fasta file. Is that going to cause issues when using make.fastq?

The order will cause issues with make.fastq, but mothur has a helpful command for that. http://www.mothur.org/wiki/Sort.seqs.

I have the same problem. Somehow mothur takes less sequences from quality file (using list.seqs and get.seqs) than fasta file has. And they can’t be used to make a fastq file.

Could you send your files to mothur.bugs@gmail.com?

I found that the problem in my case was inconsistency between fasta and quality sequences. How does mothur demultiplex in the trim.seqs command with an oligos file? When it finds a sequence with correct barcodes then does it find matching quality scores by the sequence header or it assumes that fasta and quality files are in the same order?

The trim.seqs command assumes the files are in the same order. You can use the sort.seqs command, http://www.mothur.org/wiki/Sort.seqs, to correct the order if needed.