I am completely new in the “sequence analysis world” and I got 454 sequences with the FLX+ technique. That means very long sequences (up to 950 bp) and they begin randomly with the forward or reverse primer. I figured out how to trim the sequences and that I have to create the group file by my own because the barcodes were already removed. I aligned my sequences to the silva.bacteria reference but now I completely stuck because I canÂ´t orientate on the tutorials cause my sequences look completely different. The summary after aligning looks like this:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1051 2 0 1 2633
25%-tile: 1044 25294 640 0 5 26326
Median: 1044 27654 731 0 5 52652
75%-tile: 3161 27654 822 0 6 78977
97.5%-tile: 7930 28467 874 2 7 102670
Maximum: 43116 43117 945 7 8 105302
Mean: 2824.43 25172.2 694.752 0.189332 5.17678
of unique seqs: 99596
total # of seqs: 105302
The high number of unique sequences seems to be caused by the high variance in sequence length. What should I do now? When I try to bring the sequences to a similar start and end position I loose about 65% of my sequences. Is the reason for this variance also because the sequencing started from the reverse or the forward primer?
I have no idea what to do and I would be so happy about any help!
I wish I would have known before that this form of sequencing causes so much truble…
Thank you so much!
So a couple of pieces of unsolicited advice…
- You really want to get the sff file from your sequence provider.
- Dual end sequencing is not helpful. The convention with 16S is to pick one end to sequence from and do that. Otherwise half your sequences will start from one end and the other half from the other end and because they rarely overlap, you have to analyze them separately. This causes more complications than anyone wants to rationally deal with.
- Don’t use FLX+ (yet). We’ve been working with Roche and they still don’t have a good set of parameters for image analysis to convert the raw data into an sff file. This seems to be a case of putting the cart before the horse on the part of several sequence providers.
Regardless… I would strongly encourage you to get the sff file from your sequence provider and then run things through the trim.seqs approach with qwindowaverage=35, qwindowsize=50 (look at the SOP). The release due out next week will allow you to run trim.flows/shhh.flows. What you are seeing is common in terms of varying sequence lengths. What you’ll do is run screen.seqs and then filter.seqs to get your sequence to overlap the same region and only that region. Then the number of uniques will drop. However… because of the problems we’ve been seeing with FLX+, the error rates are much higher than what we saw previously.
To get the reads for the first half, I’d suggest going with start=1044, minlength=something and to get the reads for the second half I’d suggest going with end=27654, minlength=something.
thank you very much for trying to help me! That makes me pretty sad cause I have to work with these sequences and to deal with the problems, now. I have the sff file. Hm, so you would suggest to analyse both sequence types seperately?