I have paired end Illumina Miseq data. So I am following the MiSeq SOP, which is really well done and easy to follow, by the way. In contrast to the SOP, my sequences are already assembled (by the sequencing company) in contigs, so I don’t have to do the fist step (make.contigs). As a consequence I however don’t know how to get the group file and the renaming of the fastq (converting : to _). Up to now I did it with make.group, I the merged the fastq to one fastq with merge.files. If I then don’t convert the fastq to a fasta I have problems when looking for unique sequences. If I however make a fasta the group names don’t fit the sequence names any more. I can avoid this by first converting all the fastq to fasta, before making the group. Up to now I did all of this with small batches but now I would like to work with all my samples which are more the 100. As far as I am concerned I can not make fastq.info for multiple files, so I would have to do that too many times, and also to make the groups I have to write all the sample names and group names by hand, which takes very long.
I am sure there is a super short and easy solution to my problem, using something corresponding to the stability.files in the SOP, but I just can’t find out how to do that without make.contigs..
If I understand this correctly, you have the groups file(s) built off the fastq files and now your problem is that when you convert the fastq to fasta mothur does a bit of sequence renaming and this mucks up the groups file?
The quickest solution I can think of is just to so a quick find/replace of the ‘:’ in your group file. If you’re on a *nix machine, then the console command
sed "s/:/_/g" old.groups > new.groups
should do the trick.
If not, mothur can be run through a bash script and if you have the original separate fastq files you could automate the fastq->fasta and group generation process. I’m not too sharp on bash, but I think this would work (you may need to play with the quotes a little bit - sorry, I can’t test this at the moment).
for f in *.fastq do
mothur "#fastq.info(fastq=$f); make.group(fasta=current, groups=$f, output=$f.groups)"
cat *.fasta > final.fasta
cat *.groups > final.groups
Which would run the fastq.info command on each fastq file in the folder and then perform the make.groups() on the output. This would leave you with the fasta, qual and groups file for each input. The individual fasta and groups files are then stuck together into your final files for the SOP.
Thank you I appreciate your effort very much (you even wrote a small script) .
The group file is however completely problematic, it is not just the _ missing. But I think that is because the : in the sequence names in the fastq is messing it up…
For Example grouping a fastq sequence :
HWI-M02024:81:000000000-AACUJ:1:1101:18936:3448 1:N:0:HWI-M02024:81:000000000-AACUJ:1:1101:18936:3448 1:N:0:TTGATCCG
TGGGGAATATTGG (…) into group A
will turn out like this:
If I make a fasta:
But that is probably just because of mothur expects a fasta, and I have a fastq…
I think your script might solve my problem, butI still find it hard to believe that there is nothing implemented to make a group files with multiple fastas without having to write all the fasta names and all the group names in Mothur. How would ppl for example do this who have single read Illumina data and not paired end? What I mean is that it would be great if there was an option where I could give Mothur a list with fasta names in col 1 and groups in col 2 and mothur would make the group file., like when you make contigs. If this doesn’t exist I will very happily use the bash script!
But could not go further even like you did, Eunde. I have a single full fasta file and a single full qual file, with all the sequences from all my 16 samples in them. Reads are assembled in contigs already, all in same direction and still with barcodes (according to what the sequencing service provider said)
How do I proceed to:
replace “:” by “_” in seq names in the fasta file?
create the groups file to start the MiSeq SOP from the screen.seqs command?
I am stuck here and cannot proceed with the analysis, any help will be more than welcome!!
Can you repost this to a new thread? This thread is 5+ years old.
If your reads are already assembled, but have the barcodes and primers on them still, then you’ll need to use trim.seqs. But honestly, you should get the truly raw (fastqs) from the sequence provider so that you can post them to the SRA when you are ready to publish.
I have the fastqs but with reads with barcode, both fwd and rev, in both R1 and R2 files and I don’t know how to use the make.contigs command.
I also have a full fasta file with the contigs to start at screen.seqs but I don’t know how to creat the groups file in that case.
I’d very much appreciate help to start following the SOP and process the sequences.