Data files for SRA submission

Hello,

I’d like to upload our 454 16S rRNA data to the NCBI SRA for public access. From what I’ve read it needs to be uploaded as either SFF files or FASTQ files separated by sample.
The problem is that we had multiple SFF files which were merged in the subsequent analyses. Also some samples were run more than once and are therefore in more than one of the SFF files.

Is there a way to separate the original SFF files by barcoded sample, keeping them in SFF format, and then merging those SFF files for the same sample? Alternatively, would it be better to use the output of the shhh.flows command - shhh.fasta and shhh.qual files for each sample to generate FASTQ files, if that’s possible? Or is there another way of doing this?

Thanks a lot in advance for any help.

Best wishes
James

Try running sffinfo with an oligos file and that will give you separate sff files for each barcode. The qual file from shhh.flows is pretty useless for generating a fastq file.

You can also use your groups file to get sequence accessions for each group, and then use the Roche sfffile program to make a new sff with just these accessions, one sff per group.
Below script will do just that, note however that it only accepts one input sff file, so you will have to merge your original files using sfffile, or modify the script. Actually, it may just work by giving a directory of sff files as the second option, haven’t tried.

#!/usr/bin/perl -w
use strict;
use warnings;
use Getopt::Long;

my ( $file, $sfffile );

my $result = GetOptions(
    "file=s"    => \$file,
    "sfffile=s" => \$sfffile,
);

unless ( -e $file ) {
    print "Usage: split_sff.pl -f final.groups -s reads.sff\n";
    exit;
}

open my $IN, '<', $file
    or die "Could not open file $file: $!\n";

my %data;

while (<$IN>) {
    chomp;
    my ( $seqid, $group ) = ( split /\t/ );

    push @{ $data{$group} }, $seqid;
}

close $IN;

for ( sort keys %data ) {
    my $outfile = $_ . "_ids.txt";

    open my $OUT, '>', $outfile
        or die "Could not open file $outfile: $!\n";

    for ( @{ $data{$_} } ) {
        print $OUT "$_\n";
    }

    close $OUT;

    my $sff_outfile = "$_.sff";

    system("sfffile -i $outfile -o $sff_outfile $sfffile");
}

Thanks for the suggestions. I’ll have a go.

Thanks, Pat! It works.

sffinfo was useful for separating the sff files by sample, thanks. I would now like to merge two or more of the sff files. On the merge.files entry it’s noted that this command cannot be used for sff files but it does say:

Actually, to merge SFF files you can simply use this command:
sfffile -o combined.sff infile1.sff infile2.sff

I take it this is not a command that can be run in mothur itself? Is the tool for this available somewhere?
Thanks

Hi everyone,

If “qual file from shhh.flows is pretty useless for generating a fastq file” and Roche sff tools are not available, what’s the way out for a correct submission in JamesoK’s case (and mine too!). Thanks for helping.

Fred

Hi,

I am currently trying to submit to the european SRA (EBI SRA) https://www.ebi.ac.uk/ena/submit/sra/#home
As suggested above I have used sffinfo with the oligos option to demultiplex my sff file for submission. I have one sff file from the 454 GS FLX pyrosequencer.
Sffinfo gives me nicely demultiplexed sff files but the sffview from Roche makes the ebi SRA submission process stop. I get a “file integrity check failed” error.

The following is the output of the sffview for one of the files:

sffview H2L0GVH01.Tine13.pA.sff | head
Error:  Unable to read SFF file:  H2L0GVH01.Tine13.pA.sff Common Header:
  Magic Number:  0x2E736666
  Version:       0001
  Index Offset:  1439796800
  Index Length:  9110590
  # of Reads:    17263
  Header Length: 840
  Key Length:    4
  # of Flows:    800
  Flowgram Code: 1
--------

I really didn’t know what went wrong. I tried again with trim=F, however the result is the same, only the line at which the error is displayed comes a bit further.

H2L0GVH01.Tine1.pA.sff | head Common Header:
  Magic Number:  0x2E736666
  Version:       0001
Error:  Unable to read SFF file:  for_check_by_marc/H2L0GVH01.Tine1.pA.sff
  Index Offset:  1439796800
  Index Length:  9110590
  # of Reads:    4741
  Header Length: 840
  Key Length:    4
  # of Flows:    800
  Flowgram Code: 1

I am lost what goes wrong here. I did the analysis on this data in mothur with no issues whatsoever. Is this typical behaviour of sffinfo? Did somebody have this problem before? How can I possibly fix it? Is there a way to make bam or fastq (also demultiplexed!) files in mothur?

Kind regards,

FM

Hi,

Although it doesn’t fix the issue with the sff’s I have currently taken the following approach to produce *.fastq files from my demultiplexed *.sff files.
First I have used the sffinfo command with an oligos file supplied. Then I ran sffinfo again on each separate demultiplexed sff file with no further options. Finally I used the fasta and qual files from the output of the last sffinfo command to generate a fastq file withmake.fastq. Let’s hope this does the trick.

Kind regards,

FM

We have a bug with the Windows version and sffinfo command using the oligos option. It will be fixed in version 1.33.0 releasing very soon. We are also adding a group parameter to the sffinfo command. Version 1.33.0 will also include an oligos parameter and parsing options to the fastq.info command to make preparing your files for submission easier.

Hi, I was wondering why you ran sffinfo again. I tried sffinfo → trim.flows → shhh.flows and tried to make.fastq using the fasta and qual file given, but gets [ERROR]: mismatch between fasta and quality files. Found H2ZDNKJ02I00OR in fasta file and H2ZDNKJ02GTU91 in quality file…

.qual and .fasta file contains same number of reads but not all read name is the same…

Problem solved (although I dont fully understand why this is). Sorry should have read the earlier post properly before posting!!