I have sequences from 4 full plates. I wonder if it’s possible to input 4x everything in all commands in the workflow? I know some commands you can pass several files separated by “-”. I just finished running the plates separately, and then concatenating the final output files, but I didn’t think about that e.g.the seqs in the fasta files will not have same length… For the analysis I need the processed seqs to be in one file.
I guess I could (locally) align the concatenated sequences using e.g. MAFFT?
You’ll basically get to the end of trim.seqs with one fasta, name, and group file after running each of your files through sffinfo, trim.flows, shhh.flows, and trim.seqs. Then you can pick up the SOP at unique.seqs, align.seqs, etc. If you’re doing 16S, I wouldn’t suggest touching MAFFT.
What I ended up doing was concatenating the fasta,name & groups files from the 4 plates just where the shhh.flow left off… Although, I now ended up with lesser amount of sequences (started out with the same as the 4 X e.g. fasta files);
EDIT: added #seqs when concatenating after trim.seqs as suggested by Pat below. Also added when running the chimera.uchime with flag dereplicate=t and the remove.seqs with flag dups=f
4 X/concat post shhh.flows /concat post trim.seqs/concat post trim.seqs + new flags
Total: 702979/676445/699322/712524
Unique: 65199/60366/61744/61864
…maybe this isn’t so strange? I could imagine that # of seqs would be reduced even more in the concat dataset in the pre.clustered and chimera.uchime steps compared to when running it separately on 4 subsets.
So if your 4 sff files have the same barcodes for different samples, you may want to concatenate after trim.seqs. One reason why you might be getting different total numbers is in the chimera checking. By default if a sequence is flagged as being a chimera in one sample it will get yanked from all of the samples regardless of whether it was flagged in those samples. This could be happening here. The next release will allow you to turn off this feature. We’ve seen it cause problems in some cases where some sequences that are abundant in some samples get flagged in another sample where it is rare (e.g. pre, during, and post antibiotics).
I have been trying to combine a number (~100) .sff files from our sequence provider.
I have had to divide them up into 7 groups so that the barcodes weren’t overlapping in each group for the sff.multiple command.
At the end i use the merge.files command to make one .fasta, one .names, one. groups and one.summary files. All seems good when i do summary.seqs on this but then when i do unique.seqs it tells me that
“…
22000 16073
23000 16792
[ERROR]: You already have a sequence named HWWRNVT02EYIIU in your fasta file, sequence names must be unique, please correct.
…”
If i keep going it seems ok (with a number less sequences) but then when i try and pre.cluster it again causes problems with
“Your groupfile contains more than 1 sequence named HWWRNVT02EYIIU, sequence names must be unique. Please correct.”…
Is the problem that by doing multiple sff.multiple it ends up giving some sequences the same name? if so is there anyway around this given that the same barcodes have been used multiple times?
Thanks for your help and i really enjoyed the course a few weeks back.
Sequence names are unique within and between runs. I suspect that either your splitting wasn’t as perfect as you had hoped or your merging included too many files. Are you using sff.multiple? http://www.mothur.org/wiki/Sff.multiple
Hi Pat,
thanks for your reply.
Yep i used sff.multiple.
I had to do it as 5 separate lots so that there weren’t any barcodes used more than once in a single sff.multiple run.
After i got the outputs from each of these i combined the files (5 of each).
Could you post the sff.multiple and merge.files commands you ran, as well as the input file to sff.multiple? Have you tried to see which sff file HWWRNVT02EYIIU came from? Perhaps you inadvertently added this file twice somewhere?
Thanks for your help, I figured it out.
When putting the files together for sff.multiple i had accidentally put a couple of the entries twice, hence the duplicate names.
All good now.