I am wondering if all commands that work with 454 pyrosequencing reads work equally well with Ion Torrent reads?
For the most part, but I would be suspicious of the sequence quality. It would be nice to have some IonTorrent mock community data to vet things like parameters for trim.seqs or to see whether we can do something like shhh.flows.
Pat
Hi,
Just curious. Is there a paper or protocol out there for use of 16S rRNA on Ion Torrent?
Thanks,
Ameet
I am trying to use trim.seqs with an Ion Torrent dataset and whatever quality settings I use, the trimmed fasta contains zero, zilch, nada sequences.
Any advice?
Csaba
When you look at your scrap.fasta file what codes do you see after the sequence names?
So what should we do to generate Torrent mock community data?
I tried trim.flows on a Torrent sequence dataset, which didn’t use barcoding, the output was strange. When I used the default settings, the generated trim.flow file contains no sequence. Then I changed the minflows to 10 and maxflows to 800. This time the trim.flows file contains some sequences but some intensity values are very strange, some values are -NaN, some value is like 3321368936448.00 …
So what should we do to generate Torrent mock community data?
Well, you’d have to create your own mock community and sequence it. I’m not aware of any datasets that are currently available out there.
I tried trim.flows on a Torrent sequence dataset, which didn’t use barcoding, the output was strange. When I used the default settings, the generated trim.flow file contains no sequence. Then I changed the minflows to 10 and maxflows to 800. This time the trim.flows file contains some sequences but some intensity values are very strange, some values are -NaN, some value is like 3321368936448.00 …
If you look at the first line of the flow file, what is that number? I’m pretty sure it isn’t 800.
So if I am gonna create a mock community and sequence it, do you have some suggestions on the composition of the community so it would be best for estimating the parameters for trim.flows and shhh.flows?
The number is actually 325. I then tried trim.seqs on another dataset, this time the number is 488.These two datasets were sequenced using the same kit on the same sequencer. So this is quite different from 454 platform, which is the same for the same type of sequencer. So is trim.seqs able to handle this kind of situation?
So if I am gonna create a mock community and sequence it, do you have some suggestions on the composition of the community so it would be best for estimating the parameters for trim.flows and shhh.flows?
You can get the HMP mock community from BEI. Alternatively, if you have old clones that you’ve sequenced, you could use those. I think if you have 25 diverse DNAs pooled evenly you should be in good shape. You might also try to pick a few closely related sequences too.
The number is actually 325. I then tried trim.seqs on another dataset, this time the number is 488.These two datasets were sequenced using the same kit on the same sequencer. So this is quite different from 454 platform, which is the same for the same type of sequencer. So is trim.seqs able to handle this kind of situation?
Yeah, they’re likely to be different. trim.seqs shouldn’t have a problem, assuming you know the correct quality score settings, which you’d get from a mock community.
Sorry for the late reply. I did not get notification of a reply.
So here is what I did.
I used sffinfo to get the qual and fasta files.
Then I do :
trimseqs(fasta=sd5.fasta, qfile=sd5.qual, minlength=50, maxhomop=8, qwindowsize=50, qwindowaverage=35, processors=10)
with version 1.25 Iget some seqeunces but not too many.
722 ends up in trim.fasta
4,200,000 ends up in scrap.fasta
If i do the simpler version of trim.seq
trimseqs(fasta=sd5.fasta, qfile=sd5.qual, minlength=50, maxhomop=8, qaverage=35, processors=10)
All sequences end up in the trash.
The reason for the rejection in the scrap file is always “q”.
Am I too stringent with the quality?
Csaba
Unless you have a mock community, it’s really hard to tell whether you’re being too stringent.
I’m pretty sure that what comes off the sequencer is qaverage=25 - so setting it to 35 will result in everything getting trashed. You could try just doing qthreshold=20 or qwindowaverage=30 and see what happens. Hopefully we should have better guidance in the next few weeks.
Pat
Hi Pat!
Thanks for the quick reply. I have tested these trimming settings:
trimseqs(fasta=sd5.fasta, qfile=sd5.qual, minlength=50, maxhomop=8, qthreshold=20, processors=10)
This resulted
2,584,078 seqs in the trash all of them with a “q” filter
1,719,834 seqs in the trim file
trimseqs(fasta=sd5.fasta, qfile=sd5.qual, minlength=50, maxhomop=8, qwindowsize=50, qwindowaverage=30, processors=10)
resulted
1,791462 seqs in the trash all of them with a “q” filter
2,512,450 seqs in the trim file
This result looks certainly better. However, as you mentioned, it is very difficult to judge which is the good setting. Let me elaborate on that a bit and hopefully provide a solution.
We are sequencing antibody genes from a huge library. We only care about the most diverse part of the library, which is the heavy chain CDR3, but that’s not too important for this discussion. What is important is that these sequences should not contain STOP codons or very few ones only.
What we noticed and is quite disturbing that if we sequence our library with 454 and IONtorrent then we have 10(!) times more stop codons in our sequences using iontorrent data. That’s not good.
So what we can do is, try different qthreshold, or qwindowaverage settings and analyze our data and see if the occurence of the STOP codons go down.
What do you think?
Csaba
That’s a great control - yikes! I would tend to trust the qthreshold=20 (or even =25). I’m suspicious of the iontorrent data output and have been hearing mixed reports of the data quality/error rates. Sorry I can’t help more. One other option would be to do keepfirst=100 and only look at the first 100 bases. If you can’t trust that then…