I’m up to performing my very first Mothur 16S analysis. After reading through a whole bunch of good advice on the wiki and on this forum, trying to find out which commands I want to use, testing them, constructing an entire pipeline for my data, I’m left with one very basic question about merging files.
This is the case: I have 7 metagenomic samples (geological survey at different sites), of which a 16S region was sequenced by 454. As output I received 7 fastq files in which I can only find the forward primers, but not the barcodes. According to the company, the barcodes were indeed removed, so this seems to be correct (but they also said the primersequences were removed, which they clearly weren’t). So my oligos file for trimming only contains a forward primer and that’s it. I also have sff files, but I 'm not sure how to use them. I read about the shh flow etc, but have to confess I don’t completely understand what the rational is behind using these and not the fastas.
Anyway, my question:
Since I want to go and compare my 7 samples for a common ‘backbone’ versus different dominant groups, similarities etc, I believe it would be best to process the samples together, to get common OTUs. So my thinking was to first trim thee individual fastas thoroughly and them merge them to proceed with alignment and clustering and such. But my question is: how does this merging work? Will I be able to distinguish afterwards which sequence came from which sample? Since I don’t have barcodes, I’m unsure about the identification of my sequences after merging. Or is there a command to add a tag to my sequences during trimming? Maybe I overlooked it, but I didn’t seem to find something like that. But maybe it is not nescesarry and merge.files does this automatically? Or this would exactly be the reason to use sff files?
Thanks for any help on this basic question(s)! Just don’t feel like performing my entire pipeline and ending up with useless info…
I think it would be hard to accomplice the first steps of this in mothur.
As I understand you have 7 separate files for 7 separate samples, and after you have trimmed them you want to combine them into one file
If you have combined them all into one fasta file and aquired a group file (which will say that first m sequences in fasta are from first sample and next n from second and so on) you should be able to follow the mothur sop without much trouble.
I dont know how much UNIX-like op-sys experience and scripting experience do you have? The combining of 7 fasta files could be done jsut by cat- command in unix/linux. The bigger problem is that before that you should append to fasta names somekind of idendifiers(for gorups) and that you have to make a group file of your own (from said identifiers) these require some scripting.
(logical scripting steps - take each of your fastas, in the > lines of each add to the end _group1 etc, append them together in one file, use next script that will take > line and write it out in next file name tab group1 etc)
If you know yoursef or know somebody who can script perl or python or something it will not be a too big deal.
I had to do something like that half a year ago. If I find them and you want to I can post you samples of my scripts.
Well, I have a veryvery limited experience with scripting, and the only colleague I have to help me with it is severely overdemanded.
BUT, I did some digging on the use of sff in stead of fastq. Since I have them, I wanted to know what is in there. And turns out that those fastas DO contain the barcodes. Although the company told us they would be exactly the same as the fastas from the fastq… So using the fastas extracted from the sff, kind of solves my problem, since I don’t have to add my own identifiers to the sequences anymore. I will merge them first and then trim for barcodes etc, making the group files.
Now I’m looking into using shhh.flows in stead of trimming by qual files, as I have the impression from the SOP this should be a lot better. Correct me if I’m wrong, but it seems that the best pipeline for me would be:
sffinfo, for each of my 7 files separately
trim.flows, idem, since I have 7 separate .flow files
shhh.flows on the .trim.flow, again 7 times (it doesn’t take that long, fortunately)
merge.files to put the 7 shhh.fastas in one fasta
trim.seqs to cull primers and barcodes, creating a groupsfile and some more cleaning up
continue on the SOP
I do wonder how/if I should merge the shhh.names for each of the seven samples, but I guess I will do some more digging myself and I will find an answer
So, Jenz, thanks a lot for your very fast reply! I guess you even replied too fast, since I was just going to let everyone know I found an answer myself (I think). So sorry for wasting your time! I hope this post can be of help for some other Mothur-starter like myself in the future…
So you can definitely do what you want in mothur. A couple of things…
As output I received 7 fastq files in which I can only find the forward primers, but not the barcodes. According to the company, the barcodes were indeed removed, so this seems to be correct (but they also said the primersequences were removed, which they clearly weren’t).
This is pretty frustrating for the users. I’m not sure why the sequencing providers are doing this. You really need to get ahold of the sff files because these (1) will have the barcodes and (2) will provide you with the flowgram data. Going the shhh.flows approach through the sop will get you more and longer sequences than using the quality score approach. Who is doing your sequencing? If they won’t adapt or are giving you grief, I’d take you money elsewhere.
The pipeline you posted is mostly right. Here’s what I would do for a fail-safe approach…
sffinfo, for each of my 7 files separately
trim.flows, idem, since I have 7 separate .flow files
shhh.flows on the .trim.flow, again 7 times (it doesn’t take that long, fortunately)
trim.seqs on the 7 shhh.fasta files
merge.files to put the 7 shhh.trim.fastas in one fasta
Notice the last two lines are flipped. My worry would be that if the 7 files were sequenced on separate runs they may use the same barcode. If the same barcodes aren’t re-used then you could probably run merge.files right after sffinfo. To pull this off you’ll definitely need the barcode sequences. Also, merge.files will merge name, group, fasta, and flow files (Redirecting…).
This is great! I was doubting about the right order in which to perform the steps, so now I’m feeling more confident!
I’m indeed never going to choose this company for my sequencing projects EVER again. They were cheap and the quality of the data is very OK. But their service is zero. Of course, I should have decided on how exactly I was going to process the data I was going to end up with BEFORE choosing a sequencing company, so I could ask for more details on the output. A beginner’s mistake that won’t happen to me again!
Thanks again, for your reply, and for Mothur, I love it!