sffinfo - skipped the first base?

Hi Pat et al – I am very excited about the sffinfo() function you added to the latest release of mothur, as we don’t have the Roche sffinfo software on our lab computers – thanks! I’m running into a problem, however, when extracting sequences. The fasta file we got from our sequencing center has an additional base on the beginning of every sequence that is trimmed by mothur. Looking at the sff.txt file created, the bases entry for an example sequence is:

Bases: tcagaGATACACGCGCCATGCTGCCTCCCGTAGGAGTCCTGAGCCAGGATCAAactctggactgagcgggctggcaaggcnn

The fasta entry returned from our seq center is:

>FVUWOJD01D2GKY rank=0000053 x=1551.0 y=480.0 length=54
AGATACACGCGCCATGCTGCCTCCCGTAGGAGTCCTGAGCCAGGATCAAACTCT

But, mothur/sffinfo() cuts off that first A, giving:

>FVUWOJD01D2GKY
GATACACGCGCCATGCTGCCTCCCGTAGGAGTCCTGAGCCAGGATCAAactct

This happens systematically, trimming one extra base from the beginning of all sequences. That chopped-off base is part of our barcode, so it has a problematic effect.

The TCAG key is at the beginning of every sequence, and is supposed to get trimmed. Consequently, the “Clip Qual Left” value is always 5. I think perhaps mothur is interpreting that Clip Qual Left value exclusively (i.e., removes bases 1-5) when it should be inclusive (i.e., it should remove bases 1-4 and leave 5). Either that, or Clip Qual Left is actually supposed to be 4, and for some reason is wrong? Please let me know if there is anything else I can provide, and thank you for adding this function!

Cheers,
Jeff Werner

One other slight difference between mothur-created sff.txt files and the normal format: the key term “Flowgram:” is used in the traditional sffinfo output, but it is given a capital “G” from mothur, “FlowGram:” It was of-course easy to search-and-replace, but I thought I would mention it. The capital G messed up the Denoiser software we are using. Also, the miscounted bases from the Clip values results in some lower-case bases that should be upper-case, shown in the above post. Thanks!

I can confirm that this bug exists in Mothur 12 for Cygwin. I can also further propose that the solution is to edit the file sffinfocommand.cpp such that line 589 reads:

seq = seq.substr(header.clipQualLeft-1, (header.clipQualRight-header.clipQualLeft+1));

and similarly throughout the file.

This is the kind of bugfix that us users could easily submit if there were a working CVS or SVN repository.
Nudge nudge. Wink wink.

:mrgreen:

Robin

Oh and another thought. When a trim value is given as “0” it means to trim the sequence from the very beginning or the very end. i.e. “accession 25 0” means keep everything from the 25th base to the end. I don’t see that mothur accounts for this.

Robin

Sorry about this guys and thanks for bringing them to our attention. We’re going to release a 1.12.1 with these changes and a couple others in the next day or two. Please be patient as we get the kinks worked out, we agree that this will be a great feature once we get it right…

I am pleased to report that, at least sequence-wise, mothur’s sffinfo command in 1.12.1 matches Roche’s output. However, glancing over the code, I’m still not sure mothur handles the situation where the trim value is “0”. This can potentially lead to negative array indices being passed as the code is currently written. Granted, in no case was this an issue in my sff files because, evidently, the terminal bases are almost never usable anyway. But it’s worth keeping in mind.

There are still differences in formatting between mothur’s output and Roche’s. Roche provides much more information on the definition line and wraps the sequences at 60 characters. I doubt this will break anything downstream, but I think it would be good to emulate Roche’s output as closely as possible.

mothur:

F3MK0ZM01DCTYE
GACACTCTCGAGACATCACCGGCTCTAGTTCTTTAGTCATCGAGTCAACATGTCTTGTGGCGCCTTTCAAAGTGGTAGAA
ATACAGACGATGAATCTACCGCTTCTCTTATAGTCTGCTGCATTTGCG

Roche:

F3MK0ZM01DCTYE length=128 xy=1259_0788 region=1 run=R_2009_10_05_14_25_37_
GACACTCTCGAGACATCACCGGCTCTAGTTCTTTAGTCATCGAGTCAACATGTCTTGTGG
CGCCTTTCAAAGTGGTAGAAATACAGACGATGAATCTACCGCTTCTCTTATAGTCTGCTG
CATTTGCG

Robin

Excellent news! Thanks for such a rapid response!