How to make reference database of nifH genes to use as reference in Mothur

sekhwal · October 1, 2017, 3:46am

Hi, I need to analysis MiSeq sequences of nifH gene using Mothur. So, i am looking to use reference database, please let me know how i can make database or any other already available database that i can use as reference. Thanks

pschloss · October 2, 2017, 11:59am

For alignment or classification? I’d encourage you to look at the “standard” reference files we provide and try to emulate those. If you run into a problem, holler and we can help you out.

Pat

sekhwal · October 3, 2017, 3:08am

Where i can get “standard” reference files? It mean Silva reference files. I think, the silva references have 16S rRNA and 18S rRNA gene sequences so they would be of little use for aligning nifH genes…

pschloss · October 5, 2017, 11:20am

You need to replace those reference files with what you come up with for nifH. Please look at the silva and RDP reference files to see what they are supposed to look like.

Pat

sekhwal · October 6, 2017, 4:18am

Hi,
Silva database file looks like–

AF515816.1
…AAA---------------------------------------------------AG-C-C-C-AC-C-A-----A-AG-C-T–A-C–G-A------------T–AG-G–T-------
AY688433.1
…AAA---------------------------------------------------GG-C-T-C-AC-C-A-----A-GC-C-T–A-C–G-A------------T–CC-A–T-------
AJ582031.1
…AAT---------------------------------------------------GG-C-T-C-AC-C-A-----A-GG-C-G–A-C–G-A------------T–CA-G–T

So, do i need to make a database for nifH? Please elaborate answer…Thanks

pschloss · October 9, 2017, 11:36am

That’s correct. Sorry, I’m not sure what step you’re stuck on. It’s a multistep process and I don’t know where you are struggling.

Pat

sekhwal · October 10, 2017, 9:43pm

Ok, I am following these steps, please let me know if i am correct. I will try from first step and let you know when i get stuck.

Download a set of trusted reference sequences protein sequences. The best case would be to get these from SwissProt, although if there aren’t enough sequences there you can expand it using the rest of UniProt or NCBI.
Perform a multiple sequence alignment of these sequences.
Reverse translate the proteins into nucleotides.
Visualise the alignment and manually refine positions if necessary.
Add padding gaps into the alignment. This last step is a bit arbitrary but the basic idea is to add extra gaps between each position in the alignment space, similar to how the SILVA alignment stretches the ~1,600 16S rRNA positions over 50,000 spaces. I typically go for 10 gaps between each position in the alignment. For example, the aligned sequences

sekhwal · October 12, 2017, 8:29pm

I downloaded ~90000 protein sequences from NCBI and its looks like below pattern, however i have concern for these sequences, these don’t appear to have any taxonomic information with it that i could use to do classification. How i will get taxonomic file at the end…

AGV01051.1 NifH [Enterobacter sp. R4-368]
MTMRQCAIYGKGGIGKSTTTQN…
ADM52729.1 NifH [Paenibacillus sabinae]
MSKKPRQIAFYGKGGIGKSTTSQNTLAQLA…
AAF82637.1 NifH [Trichodesmium erythraeum IMS101]
MRQIAFYGKGGIGKSTTSQNTLAAMANRHG…

pschloss · October 13, 2017, 7:47pm

You will actually need DNA sequences, not amino acid sequences. You will need to extract the taxonomy information from the GenBank accession data.

Pat

sekhwal · October 15, 2017, 2:46am

I am following the steps that i got suggestions from my previous post Can we use Mothur for nifH analysis? Ok, I will download DNA seq from GenBank.

sekhwal · November 3, 2017, 3:18am

I downloaded ~77000 gene sequences from GenBank and aligned with Mafft. The output file looks like this–please let me know what next step to make database.

Z31716.1 Nostoc PCC 6720 nifH, nifD, nifU genes
--------------aattcctctgggcaaaaa----------------------------
---------------cgacccctcaccaacgtgcagaagattgccctcattcaaaaagta
ttagacgaagaag-----------------------------------------------
-------------------------------------------taagacccgtattgatt
gccgacggcggagatgtagaactct-----------------------------------
----------------------acgatgtagacggcgatattgtcaaagtagtac-----

######################################################
A published nifH database nifH Sequence Database | Buckley Lab (Gaby and Buckley, 2014) looks like this
#################################################

AB079619 461 bp dna AB079619
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
.CAGATCGCG TTTTACGGGA AGGGCGGAAT CGGC------ —AAGTCCA
CCACTTCGCA GAACACGCTA GCGGCG---- -----CTGGC AGAG—ATG
GGT—CAGA AGATCCTGAT TGTAGGGTGC GATCCGAAAG CGGACTCGAC
TCGC—CTT ATT—CTGC ACGCCAAG-- -GCTCAAGAC ACGATTTTGA

sekhwal · November 5, 2017, 3:40am

And Previously Dwaite suggested me the following steps. I also tried to download nifH reference protein sequences (~67000) of bacteria and aligned with Mafft. But i am not getting backtranslate (aligned protein to aligned nucleotide) using EMBOSS and then refine alignment. please let me know, is it important to go through protein sequences and translate into nucleotide seqs or can i go directly align nucleotide seqs and then refine alignment and then add gaps.

Download a set of trusted reference sequences protein sequences. The best case would be to get these from SwissProt, although if there aren’t enough sequences there you can expand it using the rest of UniProt or NCBI.
Perform a multiple sequence alignment of these sequences.
Reverse translate the proteins into nucleotides.
Visualise the alignment and manually refine positions if necessary.
Add padding gaps into the alignment.

Topic		Replies	Views
Can we use Mothur for nifH analysis? Theory behind mothur	19	4479	May 21, 2017
nirS gene reference alignment Theory behind mothur	5	9609	October 7, 2010
Reference database and custom database Commands in mothur	1	1502	November 22, 2016
Using Silva v119 in align.seqs and classify.seqs Commands in mothur	2	3783	August 25, 2014
Using a custom taxonomic database	3	269	September 5, 2023

How to make reference database of nifH genes to use as reference in Mothur

Related topics