Incomplete trimming using sffinfo?

Hi,
I have a problem when running sffinfo that I am always getting some extra unwanted sequence at the start, which then gives me some trouble to sort them by my barcodes using trim.flows. It is not so simple as to add this extra sequence to my barcodes because it varies a bit between reads/runs, so too many good sequences keep ending up scrap.

I’ve compared the extraction of fasta files using sffinfo and Biopython seqIO.convert, and here is an example of what I’m getting:

Mothur sffinfo (trim=F)

HCZNHVY02GP28Z xy=2640_1633
gactACTCTCGTGTTACGGCCAGAGTCGGAGACTGGGGACTTCCTGGTAAAGAACGTTGCTCCGGGCCGGCCTAGTCGACTGCCAAGGCACACaggggataggn

Mothur sffinfo (trim=T):

HCZNHVY02GP28Z xy=2640_1633
ACTCTCGTGTTACGGCCAGAGTCGGAGACTGGGGACTTCCTGGTAAAGAACGTTGCTCCGGGCCGGCCTAGTCGACTGCCAAGGCACAC

Biopython raw:

HCZNHVY02GP28Z
gactactctcgtgttacggccAGAGTCGGAGACTGGGGACTTCCTGGTAAAGAACGTTGC
TCCGGGCCGGCCTAGTCGACTGCCAAGGCACACAggggataggn

Biopython trimmed:

HCZNHVY02GP28Z
AGAGTCGGAGACTGGGGACTTCCTGGTAAAGAACGTTGCTCCGGGCCGGCCTAGTCGACTGCCAAGGCACACA

The biopython trimming is always perfect (in the example above my barcode is AGAGTC). Also the fasta files provided by my sequencing facility are fine (they told me they are using a Roche tool for conversion).

I get the same result using v1.23 on my mac and v1.22 on windows.

Any ideas?

Thanks!
Marc

Hi Marc,

I’m not sure i follow what the problem is :frowning: The trim=T removes the bases that the sff file says should be trimmed. In trim=F all of the bases (good and bad) are left in. I think the problem is that your sequencing center is splitting the files by another barcode and then giving you the data. Sequences typically start with four test bases and then proceed. A couple of questions…

  1. Could you send us the first 200 lines or so of the output from sffinfo(sfftxt=T)?
  2. Does you sequencing center run the flows in a non-standard order? What order are they flowed in? (this will be in the output of #1)
  3. To run the SOP, you need to use the flow data, which will have all of the data from “gact” on through to the “ggn”. This means that you need to get the actual barcodes used by the center to split up the files.
  4. Who is doing your sequencing (can you provide us with contact information - we’ve been seeing a couple of these problems lately)?

Pat