It’s me again, asking weird questions.
I’m actually working with a large dataset of sequences obtained by pyrosequencing.
It’s a metagenomic project but my boss actually wants me, as a side project, to analyse the 16S RNA sequences that are in our samples using mothur.
I’m currently blasting all my samples’ files to determine which sequences within are 16S related. 5000 sequences per sample.
Due to the large number of sequences, i need to code a script that will read the BLAST output files and then extract “hit” sequences from the samples’ files and create the files required for an analyse on mothur.
So here is my question:
what files MUST be created and what do they look like?
This must seem stupid but as I totally see how to make a fasta file, I’m a bit lost about what a .group or .names file should look like. I have opened some to have an idea but i see no pattern…
I’m really sorry to flood your forum with my newbie’s questions.
Thanks for your help
At a minimum you need a fasta file. You only need a names file if you have redundant sequences. As for a groups file the first column is the sequence name and the second column is the group name. Try looking at some of the files generated by the SOP to get a sense of how these files are supposed to look.
Hope this helps,
Thanks for your help Pat.
Actually I think I might need a names files due to the fact that most commands I want to use need a .names as one of the inputs.
Since I don’t have redundant sequences, I imagine I just need to generate a simple list of all the sequences names.
Could you please confirm this hint?
I’m really thankful for your help.
I’d seaze the moment to congratulate you for this forum. To my mind, without a community, exchanges of methods and ideas, progress means nothing.
The easiest way to generate the names file from a fasta file is with the unique.seqs command, http://www.mothur.org/wiki/Unique.seqs.
Just to add some recommendations,
Assuming these are whole genome metagenomes and not 16S tag pyrosequencing.
A simple method to detect and download 16S rRNA and related sequences is the used of web-tools such MG-RAST and CAMERA. Just upload your metagenome and search for the gene (i.e. function). Once you selected the sequences just download as a FASTA file.
Once you download your FASTA file you can generated name and group files as mentioned before.
These could be an easy methods for those who lack the computational hardware or skills to write skill "to fish-out" these sequences.
For metagenomes you will expected less than 1% of your sequences identified as 16S rRNA or related sequences.
Important: since your sequences were generated from random amplification, these fragments will contain in addition to any region of the 16S, other sections of the genome (for example: 23S, ITS, unrelated genes). This is important for downstream application: classification, alignments, etc.
Like any web-tools (e.g. MG-RAST and CAMERA) their algorithm are limited by the database available.
Hope this help.