Hi,
I have a problem when running sffinfo that I am always getting some extra unwanted sequence at the start, which then gives me some trouble to sort them by my barcodes using trim.flows. It is not so simple as to add this extra sequence to my barcodes because it varies a bit between reads/runs, so too many good sequences keep ending up scrap.
I’ve compared the extraction of fasta files using sffinfo and Biopython seqIO.convert, and here is an example of what I’m getting:
Mothur sffinfo (trim=F)
HCZNHVY02GP28Z xy=2640_1633
gactACTCTCGTGTTACGGCCAGAGTCGGAGACTGGGGACTTCCTGGTAAAGAACGTTGCTCCGGGCCGGCCTAGTCGACTGCCAAGGCACACaggggataggn
Mothur sffinfo (trim=T):
HCZNHVY02GP28Z xy=2640_1633
ACTCTCGTGTTACGGCCAGAGTCGGAGACTGGGGACTTCCTGGTAAAGAACGTTGCTCCGGGCCGGCCTAGTCGACTGCCAAGGCACAC
Biopython raw:
HCZNHVY02GP28Z
gactactctcgtgttacggccAGAGTCGGAGACTGGGGACTTCCTGGTAAAGAACGTTGC
TCCGGGCCGGCCTAGTCGACTGCCAAGGCACACAggggataggn
Biopython trimmed:
HCZNHVY02GP28Z
AGAGTCGGAGACTGGGGACTTCCTGGTAAAGAACGTTGCTCCGGGCCGGCCTAGTCGACTGCCAAGGCACACA
The biopython trimming is always perfect (in the example above my barcode is AGAGTC). Also the fasta files provided by my sequencing facility are fine (they told me they are using a Roche tool for conversion).
I get the same result using v1.23 on my mac and v1.22 on windows.
Any ideas?
Thanks!
Marc
The trim=T removes the bases that the sff file says should be trimmed. In trim=F all of the bases (good and bad) are left in. I think the problem is that your sequencing center is splitting the files by another barcode and then giving you the data. Sequences typically start with four test bases and then proceed. A couple of questions…