Merging Files

cfitzgerald2 · June 1, 2013, 4:35am

I have a question about the proper treatment of multiple SFF files.

Brief background: I have been using a 454 sequencing to examine different environments and identify potential organisms of interest in those environments. I also have sequenced enrichments from these environments to help help confirm (we also do FISH, qCPR) if the organisms we enrich may have our predicted metabolic function(s).

I have been using a GsJR, so we get between 75,000 and 130,000 sequences back per run and I have data from 5 runs. I had processed the files separately initially, but since the files are not that large I thought it would be a better idea to process them together.I have run into trouble in the past having sequences that are not the same length when clustering and using other distance based analysis, and I thought this would be one way to combat this. I merged the files after I have trimmed the flowgrams an run shhh.flows. I also thought this would help on computing time because I would be able to reduce the number of sequences I am processing.

I have two questions:

Do you think this is the best approach, or would it be better to process the files independently and merge at some point later on?
I have no problems when I use my “whole” data set, but if I use the get.groups remove.groups command to pull only my samples of interest out I have found that some sequences in the file have missing data ("."), but in my full fasta file there is no missing data. Any thoughts on why this may be?

Thank you,

Colin

pschloss · June 3, 2013, 2:24pm

Do you think this is the best approach, or would it be better to process the files independently and merge at some point later on?

Sounds about right - this is what sff.multiple does. We take everything through sffinfo, trim.flows, shhh.flows, and trim.seqs and then we merge the files.

I have no problems when I use my “whole” data set, but if I use the get.groups remove.groups command to pull only my samples of interest out I have found that some sequences in the file have missing data (“.”), but in my full fasta file there is no missing data. Any thoughts on why this may be?

Not sure what you mean about missing data…

cfitzgerald2 · June 3, 2013, 2:42pm

Hey Pat,

Thank you for the response. And next time I will use sff.multiple to save some time. By missing data I mean there are “.”" at the of some of my sequences after I use get.groups that are not present in the initial alignment. See below.

Before:

H2P127M01AFT25
----T----T-G----G----G—C–A–CT—C-T-G-G-T-GG-A–AC-C----G-CC------G----G----T-----------------------G-A-------CA-A------------------------G-C-C-G–G-A-G-G-A–AG-G-C–G–GG-GA-TG-AC-G-TC–A-A-G-T–C–CTC-A-T-G-G-C-C-CT–T-A-T-G–G-G-C-T-GG-GC-TAC–AC–AC-G–TG-C–TA----C-AA-TG—G-C-GG-T-G-A–C-AGT-G–G-G-A-------------A-G-C-A-A–G-A-C-C-G-C-G—A-GG-T-G–G-A-G-C–A—A--------A–TCC-C------C—AAAA-G-CC-G-T-C-T-----CAG–TTC–GGA-T-C-GTAC-TC-----T-GC-AA-CT-C—G-A–GT–GC-G-T-G-AAG-TT-GG-AAT-CG-C-TA–G-TAAT-C-G-C-G-GA–TC-A-G-C–A-C–GC–C-G-C-G-GT–G-AAT-AC–GT-T—CCCGG-GCCTT

get.groups(fasta=Bactonly.fasta, name=Bactonly.names, group=Bactonly.groups, taxonomy=Bactonly.taxonomy, groups=REACTORA-REACTORB-REACTORC-B1-B2-B3-B4-B5-B6-B7-B8-NSH-NSL)

H2P127M01AFT25
…T----T-G----G----G—C–A–CT—C-T-G-G-T-GG-A–AC-C----G-CC------G----G----T-----------------------G-A-------CA-A------------------------G-C-C-G–G-A-G-G-A–AG-G-C–G–GG-GA-TG-AC-G-TC–A-A-G-T–C–CTC-A-T-G-G-C-C-CT–T-A-T-G–G-G-C-T-GG-GC-TAC–AC–AC-G–TG-C–TA----C-AA-TG—G-C-GG-T-G-A–C-AGT-G–G-G-A-------------A-G-C-A-A–G-A-C-C-G-C-G—A-GG-T-G–G-A-G-C–A—A--------A–TCC-C------C—AAAA-G-CC-G-T-C-T-----CAG–TTC–GGA-T-C-GTAC-TC-----T-GC-AA-CT-C—G-A–GT–GC-G-T-G-AAG-TT-GG-AAT-CG-C-TA–G-TAAT-C-G-C-G-GA–TC-A-G-C–A-C–GC–C-G-C-G-GT–G-AAT-AC–GT-T—CCCGG-GCCTT

Basically it looks like mothur replaced gaps (-) in all of the sequences that started with exclusively gaps with “.” , but I before I go back and replace all of those “.” with “-” I wanted to make sure I wasn’t overlooking something that may have to do with sequence quality?

Colin

pschloss · June 3, 2013, 3:46pm

The dots don’t really matter and are treated the same way. I wouldn’t worry about it.

cfitzgerald2 · June 3, 2013, 4:36pm

Thank you.

oalzahal · August 4, 2013, 10:01pm

Pat, what files should be merged?
file1.shhh.trim.fasta, file1.shhh.trim.names?? i.e.

mothur > merge.files(input=file1.shhh.trim.fasta-file2.shhh.trim.fasta, output=file.shhh.trim.fasta)
and
mothur > merge.files(input=file1.shhh.trim.names-file2.shhh.trim.names, output=file.shhh.trim.names)
i believe we won’t need the oligos file after that?
many thanks

pschloss · August 6, 2013, 11:12am

Right and you also need the groups file

oalzahal · August 6, 2013, 10:28pm

Many-thanks Pat.

chal4oye · January 17, 2014, 10:35am

in all of the sequences that started with exclusively gaps with “.” , but I before I go back and replace all of those “.” with “-” I wanted to make sure I wasn’t overlooking something that may have to do with sequence quality?

pschloss · January 20, 2014, 2:12pm

“.” characters are merely gaps at the beginning and ends of sequences. I wouldn’t mess with them.

Topic		Replies	Views
merge sff files mothur bugs	3	5556	December 7, 2011
merge sff files or flow files Commands in mothur	5	4943	March 17, 2014
sff.multiple crash Theory behind mothur	1	3181	January 15, 2014
Merging data sets Commands in mothur	4	4182	February 2, 2012
Flow files Commands in mothur	3	4721	August 14, 2013

Merging Files

Related topics