Analysis of multiple datasets with same MID identifiers

Hi everybody!

I’m quite a newbie in mothur and since now I did some pretty common analyses on small batches of samples. Now I would like to analyze a big number of samples, sequenced in 3 different rounds with the 454 sequencer (3 different datasets). Since the senquencing was done separately, I have used the same MID identifiers in the different rounds. The sequencing platform always gives me separate sff files, already according to the MID, o now I have an sff files for each sample, with all the sequences found in it.

For the past analyses I’ve done, I used the command merge.sfffiles to merge the files from a run in a single one and then I created an oligos files with the each sample and the corresponding MID sequence to identify it, and I used this file with trim.seqs, so that we know which sequences are whose.

Since now I have samples coming from multiple rounds of sequencing, I have sometimes the same MID sequence for 2 ore 3 samples, so if I merge the single sff files I have with merge.sfffiles, I won’t be able to separate the sequences right into the samples.

Is there a way to work with separate sff files from the beginning in mothur? How? If there is not, how can I approach this problem?

Thank you very much!

sff.multiple is what you’re looking for. It lets you specify pairs of sff/oligos files and runs through barcode trimming, flowgrams trimming and denoising then merges the final output.

You can re-use MIDs between oligos files - this is obviously a really common situation to be in - as long as you make sure the group names are different for each sample, otherwise they’ll get combined along the way.

Thanks for your help, dwaite!

I just have a couple of questions to clarify:

When you say

You can re-use MIDs between oligos files - this is obviously a really common situation to be in - as long as you make sure the group names are different for each sample, otherwise they’ll get combined along the way.

what exactly do you mean? I have to group the files according to their run? How do I do that?

I looked into the webpage of sff.multiple and I sow that you must have a sfffifiles.txt where you list all the files that you want to analyze together and the oligos file name. Should I do a single oligos file for all the samples, where they are all listed with their identifier sequence (even if for some is the same one) or should I do different oligos, one for each round of sequencing? Because right now I tried doing a sigle oligos file with the list of all the samples and the MID identifying sequences, but the command aborts, it is not able to recognize the sff files, apart from the first one…

Say you have 4 samples to sequence, but only 2 MIDs available (silly numbers, but roll with it :lol: ). You can only process two samples in a run - because you’re constrained by your MIDs - but since each run is independent of each other it’s no problem to re-use the MIDs between runs. So if for your first run you get an sff file Run1.sff and set up a Run1.oligos file:

forward CATGCTGCCTCCCGTAGGAGT
barcode AACCAACC Sample1
barcode AACCAAGG Sample2

then for Run2.sff set up:

forward CATGCTGCCTCCCGTAGGAGT
barcode AACCAACC Sample3
barcode AACCAAGG Sample4

You could run each of these independently through trim.flows/shhh.flows/trim.seqs and the two output sets would be able to be merged despite starting with identical barcodes. Sff.multiple lets you take a shortcut by automating the workflow by simply creating a text file like:

Run1.sff Run1.oligos
Run2.sff Run2.oligos

And using that as the input.

My warning about group names is that when you run sff.multiple your data sets are eventually merged, so if you had you two oligos files with repeated sample names, for example:

Run1.oligos
forward CATGCTGCCTCCCGTAGGAGT
barcode AACCAACC Sample1
barcode AACCAAGG Sample2

Run2.oligos
forward CATGCTGCCTCCCGTAGGAGT
barcode AACCAACC Sample2
barcode AACCAAGG Sample3

Then the Sample2 from both runs would get combined during sff.multiple and it would be difficult to split them apart.

My advice would be a single oligos file per sequencing run, because personally I find that easier to manage. If you re-used MIDs between runs then I think you’ll have to do this - this is the approach I used in my PhD work, where I ran ~60 samples in total, with only 24 MIDs. If you split your available MIDs across runs though, you can use a single oligos file.