Pyrosequencing of functional genes...


I am fallowing the tutorial by Brian Oakley- about Pyrosequencing of functional genes…( Awesome to have it by the way, thanks!)

In this tutorial he start saying that he cleaned up the sequences ( I guess with trim.seqs), and then says that he translated the sequences in each of the 3 forward frames…How did he do that for 23714 sequences?



For analysing the amino acid coding genes I would suggest the following:

  1. Trim and quality-check the sequences following Schloss SOP.
  2. Use Bioedit ( for translation. I assume here that your sequences are sequenced from the forward direction.
    -a) Open the nucleotide fasta file and translate from the first reading frame (i.e. just translate). THen, Use Bioedit´s “filter out sequences containing” - function to delete away every amino sequence that contains * or x - residues. After this check how much there is left sequences that do not contain * and x.
    -b) Open the original fasta again in Bioedit. Delete the first nucleotide column from every sequence (remember to put Bioedit into the edit mode). Again Translate, filter and check the number of good sequences.
    -c) OPen the original fasta, delete two first nucleotides, translate, filter and check the number of good sequences.
    -d) THe correct reading frame should be the one that left you with the largest number of sequences after filtering. You may wish now to do blastP (protein-blast) search for some of the amino acid sequences to make sure. This can also be easily done with Bioedit.
  3. Open the original fasta file in Bioedit.
    a) Assuming the second reading frame to be the correct one, delete the first nucleotide and save your fasta with a new name.
  4. Run summary.seqs in mothur.
  5. Based on the size frequency of your data set determine some minimum length for your sequences to be used in following analyses. The minimun length should be dividable by three. Here you must find a compromise between maximizing the number of sequences and maximizing the lenght of your future alignment. In my own dataset I have chosen 285 (95 aminos).
  6. Run screen.seqs in Mothur with minlenght (minlength=285) parameter. Remember also to screen your group file here.
  7. Open the screened fasta file in Bioedit and cut the sequences from the end to minimum lenght of 285. Save this fasta-file.
  8. Translate the sequences and filter out sequences containing * or x. Save this protein fasta file.
  9. Make an accnos list from the protein fasta file: Export the fasta to tab-limited text using Bioedit. Open the txt.file in excel so that sequence name is in the first column and sequence is in the second column. Leave only the sequence names and save as “.accnos”-file. (check for instructions for accnos-lists in mothur wiki).
  10. In mothur: Use get.seqs to pick up the sequences in the accnos list from the nucleotide fasta which was saved in step 7 and from your group file.
  11. For the fasta produced in step 10 run unique.seqs in mothur.
  12. Open the unique seq fasta in Bioedit and translate. Save the amino acid fasta.
  13. Align the amino acid sequences in fungene aligner( If your gene is not supported by this aligner you could use ClustalOmega in Mobyle Portal (
  14. Open the alignment in Bioedit and get rid of every sequence that do not align properly.
    a) In my experience, most of the sequences are aligned nicely but a small fraction do not align at all. . If you do blastP-search for these non-aligned seqs, you will most likely notice that they are” rubbish”. Some sequences also align only partly. Delete the large gap areas from both start and the end of the nice part of the alignment. Filter out sequences that do not fit the alignment. You could use here the “number of gaps in the end or in the start” – parameter of the filter out-command.
    b) Remove all the gaps from the alignment and save the amino acid fasta.
  15. Do the alignment again for the fasta saved in 14b.
  16. Open the alignment in Bioedit and do the cleaning procedure of 14a again if it is still necessary. You might have missed something. Finally, save the alignment.
  17. Do accnos-list of your alignment as in 9.
  18. In mothur: Pick the seqs in accnos-list from the nucleotide fasta file and the group file generated in step 10, and from the names file generated in step 11. Use get.seqs with dups=true parameter. (see mothur wiki).
    19.In mothur: Run chimera.uchime using the fasta, name and group-file generated in step 18. (see mothur wiki)
    20.In mothur: Use remove.seqs to get rid of chimeric sequences from the fasta,name and group file. Remember the dups=true parameter.
  19. Now you have chimera-free nucleotide fasta file which you should again translate and then align. This is now your final alignment.
  20. Calculate distance matrice for your alignment in your favourite program : Phylip, Mega, ClustalX2… and use this distance matrice in mothur cluster command. Proceed with OTU-based analyses as in Schloss SOP.

Best regards,
Antti Rissanen
Jyväskylä University

Thank you so much Risku?!
This is such a good explanation step by step! I am sorry I thank so late- I was away …
I will give it a try now
Thanks again!

By the way, a lot has changed since a number of those tutorials were created! Feel free to update them or to add your own. People find these examples very useful.



There is this new pipeline you might want to try for protein coding sequences:

See also articles:



Antti Rissanen,
University of Jyvaskyla, Finland