Getting data for NCBI submission

estaunton · January 14, 2015, 5:23pm

Hello!

I have a fasta file and a qfile from an illumina MiSeq run that has barcoded samples. I’ve finished all my analysis and been able to get mothur to work, but I’m trying to export in separate groups so I can submit it to the NCBI. I was able to break up my fasta file into all the separte communities I sequenced, but the NCBI told me they want either a fastq, or fasta/quality files. Is there anyway I can take one of my fasta files and have mothur look up the sequence name in the qfile and write it to a new qfile?

Thank you!
Eric

dwaite · January 14, 2015, 8:03pm

First up, since presumably you get your MiSeq data in fastq format to begin with can’t you just upload that, or the output of trim.seqs?

If you can’t for some reason…I don’t know if you can do this in mothur specifically but you can do this with a few console commands.

If you have your fasta file of interest, you could do this:

grep ">" final.fasta | cut -f2 -d ">" > seqs.list
grep -f seqs.list -A1 original.qual > final.qual

The first line just gets a list of the sequence names from the final fasta file. You could use sed instead of cut to remove the ‘>’ but I’ve had some bad experiences piping into sed so I tend to avoid it. The second list uses the list of sequence names as the search parameters for the second grep, which returns the lines the sequence names occur on, and the line that follows. This won’t work for any data, because there is a convention of writing sequences into 80 character lines, but mothur doesn’t do this so I think the above will work fine.

You can then stitch the files together into fastq using:

make.fastq(fasta=final.fasta, qfile=final.qual)

Which would be a good idea, because mothur will complain if anything has gone wrong (wrong number of sequences etc).

westcott · January 15, 2015, 2:04pm

You can use mothur’s get.seqs and remove.seqs commands to select sequences from the files as well. http://www.mothur.org/wiki/Get.seqs http://www.mothur.org/wiki/Remove.seqs

mothur > list.seqs(fasta=final.fasta) - list sequences in the fasta file
mothur > get.seqs(qfile=final.qual, accnos=current) - select those sequences from the quality file
mothur > make.fastq(fasta=final.fasta, qfile=final.qual) - create a fastq file from fasta and quality files.

You may also be interested in the fastq.info command. It can also be used to separate your fastq files by sample using a oligos file or group file. http://www.mothur.org/wiki/Fastq.info

mothur > fastq.info(fastq=final.fastq, group=final.groups)

or

mothur > fastq.info(fastq=originalFile, oligos=yourOligos, pdiffs=2, bdiffs=1)

andreas · February 6, 2015, 1:50pm

I used paired reads (MiSeq data) for the analysis which not totally overlap.
I started processing with make.contigs and removed primer and barcodes with trim.seqs. This worked very well.
But now I want to parse the original fastq files by sample to submit it to SRA (NCBI).
I tried the fastq.info command but it did not process my files.

fastq.info(file=parse.file, oligos=BD3.oligos, pdiffs=2, bdiffs=1, fasta=f, qfile=f)

The parse file has two columns, something like that : BD3_1_1.fastq BD3_1_2.fastq
If I used an oligo file with paired primers (which I also used for the trim.seqs command) all reads were put in the two scarp files.
The oligo file was as follows:
primer GATTACCGCGGCTGCTGG TTTGATCMTGGCTCAG
barcode TATTGAC AGAG C01f
…
Just to explain the data: The forward reads start with the barcode and forward primer and the other read with the other barcode and the reverse primer.

I also tried an oligo file containing only the barcode and forward primer. But in this case I got the error message: Could not open
Oligo file:
forward GATTACCGCGGCTGCTGG
#reverse AGAGTTTGATCMTGGCTCAG
barcode TATTGAC C01f
…
What am I doing wrong with the fastq.info?
Many thanks for your help
Andreas

andreas · February 9, 2015, 2:03pm

I thought I have found a workaround by using an oligo file with replacing “none” for the reverse primer and barcode.
Oligo file:
primer GATTACCGCGGCTGCTGG none
barcode TATTGAC none C01f
…
It works but I noticed that some of my final sequences were not in the parsed fastq files.
Are there any reason for that?
If I use make.contigs and subsequently trim.seqs I get about 11% more sequences as compared to the fastq.info command.

Further, I noticed that barcodes and primer were not trimmed off from the parsed fastq files.

Is it possible to trim the fastq files during parsing the data set?
And could you please help with the right syntax for the fastq.info command?

Many thanks
Andreas

westcott · February 9, 2015, 4:54pm

Could you post the make.contigs, trim.seqs, and fastq.info commands you are running?

andreas · February 10, 2015, 9:39am

The commands were
mothur > make.contigs(ffastq=BD3_1.fastq, rfastq=BD3_2.fastq, processors=32)
mothur > trim.seqs(fasta=current, oligos=BD3.oligos, pdiffs=2, bdiffs=1, checkorient=t, flip=T)
The number of sequences differed slightly between v.1.30.2 (which I used to process my data set) and v.1.34.4.
Using v.1.34.4 it resulted in 19981298 seqs.
The oligo file was as follows:
primer GATTACCGCGGCTGCTGG TTTGATCMTGGCTCAG
barcode TATTGAC AGAG C01f
…

mothur > fastq.info(file=parse.file, oligos=BD3-neu.oligos, pdiffs=2, bdiffs=1, fasta=f, qfile=f)
The fastq.info did not work with BD3.oligos (0 parsed sequences)
Therefore I changed it to:
primer GATTACCGCGGCTGCTGG none
barcode TATTGAC none C01f
…
I performed the fastq.info two times with different order of the two fastq files in the parse.file and merged the resulting fastq files.
(For sequencing of the amplicons, I used primer without the adapter which were ligated subsequently.)
So far I have not checked any group but e.g. C01f contained 596250 seqs. using trim.seqs and 528735 using fastq.info
Due to the removed reverse primer, I actually assumed some more sequences.
Maybe it is the wrong way to apply “none” for the reverse primer.

Thanks Andreas

westcott · February 10, 2015, 2:19pm

Can you try fastq.info with the checkorient parameter set so it mirrors the trim.seqs command?

andreas · February 10, 2015, 2:51pm

I already tried the checkorient parameter but it did not change the output.
I guess that the checkorient parameter looks for the primer/barcode in the complement sequence of the forward read but it does not check the other (reverse) read (of the second fastq file).
Do I have this right?

estaunton · February 17, 2015, 2:00pm

Thanks! This almost worked. When I use list.seqs I get an accnos file with 533 sequences. When I use get.seqs I get a quality file that has 532 sequences. Any idea why it’s dropping a sequence? And I don’t know if it matters, but the quality file is not in the same order as the fasta file. Is that going to cause issues when using make.fastq?

westcott · February 18, 2015, 2:41pm

The order will cause issues with make.fastq, but mothur has a helpful command for that. http://www.mothur.org/wiki/Sort.seqs.

Chris · March 12, 2015, 11:51am

I have the same problem. Somehow mothur takes less sequences from quality file (using list.seqs and get.seqs) than fasta file has. And they can’t be used to make a fastq file.

westcott · March 12, 2015, 3:36pm

Could you send your files to mothur.bugs@gmail.com?

Chris · March 17, 2015, 8:09am

I found that the problem in my case was inconsistency between fasta and quality sequences. How does mothur demultiplex in the trim.seqs command with an oligos file? When it finds a sequence with correct barcodes then does it find matching quality scores by the sequence header or it assumes that fasta and quality files are in the same order?

westcott · March 17, 2015, 2:05pm

The trim.seqs command assumes the files are in the same order. You can use the sort.seqs command, http://www.mothur.org/wiki/Sort.seqs, to correct the order if needed.

Topic		Replies	Views
Can ''qwindowaverage'' help me ? Commands in mothur	3	2577	September 17, 2014
Help please with a command o mothur Theory behind mothur	8	590	March 7, 2021
summary.seqs Commands in mothur	1	2173	June 20, 2014
fastq, fasta and qual Commands in mothur	2	781	December 15, 2017
Trim.seqs removing all seqs (processing single-end reads) Commands in mothur	6	1300	August 28, 2020

Getting data for NCBI submission

Related topics