I am not sure whether this is a right forum for this but maybe someone can help me.
I am struggling with 18000 sequences of 200bp-300bp long nirS gene sequences (454-pyrosequencing). My problem is how to get a good alignment from this data. Because I assume that everything there is not real nirS but some rubbish sequences I have made some prescreening using blastn against known nirS sequences. I discarded everything not affiliated to known nirS. Another prescreening was to translate nucleic acids to protein sequences and then to discard those with unknown sequences and stop codons. I have also reduced the data set to only few thousands sequences using unique.seqs command. But thereafter the problems begin.
- Muscle or ClustalW alignments produce only very uggly alignments; gaps really are not divideable by three as they should be given this is protein coding sequence. 2. I have also tried what one of the wonderful mothur-examples suggest: divide this kind of muscle or CW-alignment ,even though it is uggly, to OTUs and then make a more accurate alignment using representative sequences of the OTUs and use this alignment as a refernce alignment in mothur -aligner. The reference alignment produced like this looks better than the raw alignment but still with some uggly parts in it but it does not help a lot in the mothur-alignment. 3. I have also made a beautiful reference alignment using real nirS environmental sequences from functional gene repository but alignment of my data against this does not produce better results.
Can somebody also working with pyroseqiencing sequences of functional genes advice me what I should do. Can it be that although I made the prescreening using translation (see above) there still can be so much sequencing errors that good alignments just are not possible.
To create a reference alignment, could you align your references in amino acid space and then back translate to the original DNA sequence so your gaps are in logical places? I think MEGA does this, but I’m not sure…
Is there any reasons why mothur-aligner would not work with amino-acid sequences? I did the following for my sequences:
I had 20000 nirS-gene nucleic acid sequences. I removed the primer and barcodes and all the sequences under 200bp and with > 1 unknown bases using mothur.
Then I translated the sequences to amino acids from three reading frames using BioEdit. Using bioedit I removed sequences having unknown amino acids and stop codons.
Using amino acid sequences I made a blastp search against all the nirS-amino acid sequences deposited in Functional Gene Repository. For further analyses I took only those sequences that had a blast hit- length of at least 66aminos and whose query started at position 1.
All these screening procedures dropped the sequence number to appr 16000. After unique.seqs in mothur, the number of sequences was appr 3000.
I aligned these 3000 amino acid sequences against reference amino acid sequence in mothur. As reference I used amino acid sequences of the closest blastp hits. These amino acid sequences could be received aligned from Functional Gene Repository.
After alignment I removed all the sequences which were in the flip.accnos file. This reduced the 3000 sequence set into 2000. Thereafter I made a new alignment using mothur.
Then I ran filter.seqs command and received an alignment which looks good. Some manual adjustments must be made but still this is better than no other of my trials before.
Sorry, but align.seqs is not set up to use sequences of amino acids at this time. While I’m sure mothur would do it, I wouldn’t trust the results. I think things like kmer searching and the actual aligning would be screwed up because mothur will treat everything that isn’t an A, T/U, G or C as an ambiguous base.
Ok. Thanks. I understand. I will now try your original suggestion. I will back-translate my sequences and reference-alignment to nucleic acids and then do a mothur aligning.
Okay…I am happy for you that you’ve understand and respect of what they are saying.