Merging Files

I have a question about the proper treatment of multiple SFF files.

Brief background: I have been using a 454 sequencing to examine different environments and identify potential organisms of interest in those environments. I also have sequenced enrichments from these environments to help help confirm (we also do FISH, qCPR) if the organisms we enrich may have our predicted metabolic function(s).

I have been using a GsJR, so we get between 75,000 and 130,000 sequences back per run and I have data from 5 runs. I had processed the files separately initially, but since the files are not that large I thought it would be a better idea to process them together.I have run into trouble in the past having sequences that are not the same length when clustering and using other distance based analysis, and I thought this would be one way to combat this. I merged the files after I have trimmed the flowgrams an run shhh.flows. I also thought this would help on computing time because I would be able to reduce the number of sequences I am processing.

I have two questions:

  1. Do you think this is the best approach, or would it be better to process the files independently and merge at some point later on?
  2. I have no problems when I use my “whole” data set, but if I use the get.groups remove.groups command to pull only my samples of interest out I have found that some sequences in the file have missing data ("."), but in my full fasta file there is no missing data. Any thoughts on why this may be?

Thank you,

Colin

  1. Do you think this is the best approach, or would it be better to process the files independently and merge at some point later on?

Sounds about right - this is what sff.multiple does. We take everything through sffinfo, trim.flows, shhh.flows, and trim.seqs and then we merge the files.

  1. I have no problems when I use my “whole” data set, but if I use the get.groups remove.groups command to pull only my samples of interest out I have found that some sequences in the file have missing data (“.”), but in my full fasta file there is no missing data. Any thoughts on why this may be?

Not sure what you mean about missing data…

Hey Pat,

Thank you for the response. And next time I will use sff.multiple to save some time. By missing data I mean there are “.”" at the of some of my sequences after I use get.groups that are not present in the initial alignment. See below.

Before:

H2P127M01AFT25
----T----T-G----G----G—C–A–CT—C-T-G-G-T-GG-A–AC-C----G-CC------G----G----T-----------------------G-A-------CA-A------------------------G-C-C-G–G-A-G-G-A–AG-G-C–G–GG-GA-TG-AC-G-TC–A-A-G-T–C–CTC-A-T-G-G-C-C-CT–T-A-T-G–G-G-C-T-GG-GC-TAC–AC–AC-G–TG-C–TA----C-AA-TG—G-C-GG-T-G-A–C-AGT-G–G-G-A-------------A-G-C-A-A–G-A-C-C-G-C-G—A-GG-T-G–G-A-G-C–A—A--------A–TCC-C------C—AAAA-G-CC-G-T-C-T-----CAG–TTC–GGA-T-C-GTAC-TC-----T-GC-AA-CT-C—G-A–GT–GC-G-T-G-AAG-TT-GG-AAT-CG-C-TA–G-TAAT-C-G-C-G-GA–TC-A-G-C–A-C–GC–C-G-C-G-GT–G-AAT-AC–GT-T—CCCGG-GCCTT

get.groups(fasta=Bactonly.fasta, name=Bactonly.names, group=Bactonly.groups, taxonomy=Bactonly.taxonomy, groups=REACTORA-REACTORB-REACTORC-B1-B2-B3-B4-B5-B6-B7-B8-NSH-NSL)

H2P127M01AFT25
…T----T-G----G----G—C–A–CT—C-T-G-G-T-GG-A–AC-C----G-CC------G----G----T-----------------------G-A-------CA-A------------------------G-C-C-G–G-A-G-G-A–AG-G-C–G–GG-GA-TG-AC-G-TC–A-A-G-T–C–CTC-A-T-G-G-C-C-CT–T-A-T-G–G-G-C-T-GG-GC-TAC–AC–AC-G–TG-C–TA----C-AA-TG—G-C-GG-T-G-A–C-AGT-G–G-G-A-------------A-G-C-A-A–G-A-C-C-G-C-G—A-GG-T-G–G-A-G-C–A—A--------A–TCC-C------C—AAAA-G-CC-G-T-C-T-----CAG–TTC–GGA-T-C-GTAC-TC-----T-GC-AA-CT-C—G-A–GT–GC-G-T-G-AAG-TT-GG-AAT-CG-C-TA–G-TAAT-C-G-C-G-GA–TC-A-G-C–A-C–GC–C-G-C-G-GT–G-AAT-AC–GT-T—CCCGG-GCCTT

Basically it looks like mothur replaced gaps (-) in all of the sequences that started with exclusively gaps with “.” , but I before I go back and replace all of those “.” with “-” I wanted to make sure I wasn’t overlooking something that may have to do with sequence quality?

Colin

The dots don’t really matter and are treated the same way. I wouldn’t worry about it.

Thank you.

Pat, what files should be merged?
file1.shhh.trim.fasta, file1.shhh.trim.names?? i.e.

mothur > merge.files(input=file1.shhh.trim.fasta-file2.shhh.trim.fasta, output=file.shhh.trim.fasta)
and
mothur > merge.files(input=file1.shhh.trim.names-file2.shhh.trim.names, output=file.shhh.trim.names)
i believe we won’t need the oligos file after that?
many thanks

Right and you also need the groups file

Many-thanks Pat.

in all of the sequences that started with exclusively gaps with “.” , but I before I go back and replace all of those “.” with “-” I wanted to make sure I wasn’t overlooking something that may have to do with sequence quality?

“.” characters are merely gaps at the beginning and ends of sequences. I wouldn’t mess with them.