mothur

How to make reference database of nifH genes to use as reference in Mothur

Hi, I need to analysis MiSeq sequences of nifH gene using Mothur. So, i am looking to use reference database, please let me know how i can make database or any other already available database that i can use as reference. Thanks

For alignment or classification? I’d encourage you to look at the “standard” reference files we provide and try to emulate those. If you run into a problem, holler and we can help you out.

Pat

Where i can get “standard” reference files? It mean Silva reference files. I think, the silva references have 16S rRNA and 18S rRNA gene sequences so they would be of little use for aligning nifH genes…

You need to replace those reference files with what you come up with for nifH. Please look at the silva and RDP reference files to see what they are supposed to look like.

Pat

Hi,
Silva database file looks like–

AF515816.1
…AAA---------------------------------------------------AG-C-C-C-AC-C-A-----A-AG-C-T–A-C–G-A------------T–AG-G–T-------
AY688433.1
…AAA---------------------------------------------------GG-C-T-C-AC-C-A-----A-GC-C-T–A-C–G-A------------T–CC-A–T-------
AJ582031.1
…AAT---------------------------------------------------GG-C-T-C-AC-C-A-----A-GG-C-G–A-C–G-A------------T–CA-G–T

So, do i need to make a database for nifH? Please elaborate answer…Thanks

That’s correct. Sorry, I’m not sure what step you’re stuck on. It’s a multistep process and I don’t know where you are struggling.

Pat

Ok, I am following these steps, please let me know if i am correct. I will try from first step and let you know when i get stuck.

  1. Download a set of trusted reference sequences protein sequences. The best case would be to get these from SwissProt, although if there aren’t enough sequences there you can expand it using the rest of UniProt or NCBI.
  2. Perform a multiple sequence alignment of these sequences.
  3. Reverse translate the proteins into nucleotides.
  4. Visualise the alignment and manually refine positions if necessary.
  5. Add padding gaps into the alignment. This last step is a bit arbitrary but the basic idea is to add extra gaps between each position in the alignment space, similar to how the SILVA alignment stretches the ~1,600 16S rRNA positions over 50,000 spaces. I typically go for 10 gaps between each position in the alignment. For example, the aligned sequences

I downloaded ~90000 protein sequences from NCBI and its looks like below pattern, however i have concern for these sequences, these don’t appear to have any taxonomic information with it that i could use to do classification. How i will get taxonomic file at the end…

AGV01051.1 NifH [Enterobacter sp. R4-368]
MTMRQCAIYGKGGIGKSTTTQN…
ADM52729.1 NifH [Paenibacillus sabinae]
MSKKPRQIAFYGKGGIGKSTTSQNTLAQLA…
AAF82637.1 NifH [Trichodesmium erythraeum IMS101]
MRQIAFYGKGGIGKSTTSQNTLAAMANRHG…

You will actually need DNA sequences, not amino acid sequences. You will need to extract the taxonomy information from the GenBank accession data.

Pat

I am following the steps that i got suggestions from my previous post Can we use Mothur for nifH analysis? Ok, I will download DNA seq from GenBank.

I downloaded ~77000 gene sequences from GenBank and aligned with Mafft. The output file looks like this–please let me know what next step to make database.

Z31716.1 Nostoc PCC 6720 nifH, nifD, nifU genes
--------------aattcctctgggcaaaaa----------------------------
---------------cgacccctcaccaacgtgcagaagattgccctcattcaaaaagta
ttagacgaagaag-----------------------------------------------
-------------------------------------------taagacccgtattgatt
gccgacggcggagatgtagaactct-----------------------------------
----------------------acgatgtagacggcgatattgtcaaagtagtac-----

######################################################
A published nifH database http://www.css.cornell.edu/faculty/buckley/nifh.htm (Gaby and Buckley, 2014) looks like this
#################################################

AB079619 461 bp dna AB079619
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
.CAGATCGCG TTTTACGGGA AGGGCGGAAT CGGC------ —AAGTCCA
CCACTTCGCA GAACACGCTA GCGGCG---- -----CTGGC AGAG—ATG
GGT—CAGA AGATCCTGAT TGTAGGGTGC GATCCGAAAG CGGACTCGAC
TCGC—CTT ATT—CTGC ACGCCAAG-- -GCTCAAGAC ACGATTTTGA

And Previously Dwaite suggested me the following steps. I also tried to download nifH reference protein sequences (~67000) of bacteria and aligned with Mafft. But i am not getting backtranslate (aligned protein to aligned nucleotide) using EMBOSS and then refine alignment. please let me know, is it important to go through protein sequences and translate into nucleotide seqs or can i go directly align nucleotide seqs and then refine alignment and then add gaps.


  1. Download a set of trusted reference sequences protein sequences. The best case would be to get these from SwissProt, although if there aren’t enough sequences there you can expand it using the rest of UniProt or NCBI.
  2. Perform a multiple sequence alignment of these sequences.
  3. Reverse translate the proteins into nucleotides.
  4. Visualise the alignment and manually refine positions if necessary.
  5. Add padding gaps into the alignment.