Can we use Mothur for nifH analysis?

Hi All,
Can we do nifH analysis with mothur just like we do for 16S or 18S? If yes, What reference database needs to use? Thanks

Yup, you sure can. As you gathered, the only catch is making your own nifH references. I’d suggest curating one from genome sequences. To generate the reference alignment, translate the references to amino acids, align, and then back translate to DNA to preserve the triplet structure. You might need to add padding to the alignment to handle insertions. Then you’ve got your reference.

Pat

Hi,

Could you please elaborate the steps to make reference database for nifH seqs.

OR

Can i use “silva.nr_v123.align” database to align and classification for nifH seqs. Also, there is database for nifH (http://www.css.cornell.edu/faculty/buckley/nifh.htm). Can it use for analysis and classification?

Thank you,

Manoj

The silva references only have 16S rRNA gene sequences so they would be of little use for aligning nifH genes.

Making your own alignment database is a fairly simple process, although the difficulty can vary a bit by gene. My typical workflow for it are:

  1. Download a set of trusted reference sequences protein sequences. The best case would be to get these from SwissProt, although if there aren’t enough sequences there you can expand it using the rest of UniProt or NCBI.
  2. Perform a multiple sequence alignment of these sequences.
  3. Reverse translate the proteins into nucleotides.
  4. Visualise the alignment and manually refine positions if necessary.
  5. Add padding gaps into the alignment. This last step is a bit arbitrary but the basic idea is to add extra gaps between each position in the alignment space, similar to how the SILVA alignment stretches the ~1,600 16S rRNA positions over 50,000 spaces. I typically go for 10 gaps between each position in the alignment. For example, the aligned sequences
-A-TCG--A
-A-T-GA-A

Becomes

---------------------A---------------------T----------C----------G--------------------------------A----------
---------------------A---------------------T---------------------G----------A---------------------A----------

You can then do a typical align.seqs/screen.seqs/filter.seqs in mothur using this as the template file.

Hi,

  1. Is the same reference file will be used for the classification in the following command “classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=xyz.align, taxonomy=xyz.tax, cutoff=80)”.

  2. How i will get taxonomy .tax file.

Sorry, if my question are unclear.

Thanks

Yes, you can use the same alignment file for taxonomic classification. A taxonomy file is just a simply tab-delimited file of the sequence name followed by the taxonomy string.

They just look like:

Seq1 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia/Shigella
Seq2 Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Clostridium

You can have as few or as many taxonomic ranks as you like, although I think it needs to be the same for all sequences. How deep you want to go in the taxonomy really depends on the resolution of your gene.

As a side note - as far as I’m aware the naive Bayesian classifier has only been validated on 16S data, and I’m not sure how well it will perform on other markers. I’d recommend running some test data through it before you apply it to your real data. Or use the BLAST method instead.

Hi Dwaite,

Could you suggest me any tools/software for MSA. I am using MAFFT but i am not getting how to visualise the alignment and manually refine positions and add padding gaps.
Also, please let me know for reverse translate the proteins into nucleotides for large file.

Thanks,

I use MAFFT for alignments. To refine your alignment, I would recommend Geneious or ARB. The first is easier to use, but requires a licence. ARB is free but can be a bit difficult to work with, and doesn’t run on Windows. I’m sure there are other programs out there - Sequencher lets you align sequences, but I don’t know if you can manually alter them afterwards.

For back translation, there are two tools that come as part of EMBOSS, Backtranseq and Backtranambig, which are a good starting place. Adding gaps is just something I do in python, I don’t know of any purpose-built software for it.

Could you please let me know if the following reference database can use for nifH.
http://www.css.cornell.edu/faculty/buckley/nifh.htm

Thanks

Probably?

Yes, it’s a valid fasta file that you can use in mothur, beyond that I can’t speak to the suitability of the sequences in it - I don’t know about the expected gene length or anything.

One thing that did stand out to me is there’s little taxonomic data in those files, so although you could probably use it for alignment, you would need to build a taxonomy file for the sequences if you wanted to classify.

The fasta file that Dan’s group has posted has spaces in the sequences. You would need to either export a better version of the fasta file from ARB or ask Dan to do it for you.

For example, it has…

>AB079619          461 bp          dna          AB079619
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.CAGATCGCG TTTTACGGGA AGGGCGGAAT CGGC------ ---AAGTCCA
CCACTTCGCA GAACACGCTA GCGGCG---- -----CTGGC AGAG---ATG
etc.

It needs…

>AB079619          461 bp          dna          AB079619
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
.CAGATCGCGTTTTACGGGAAGGGCGGAATCGGC---------AAGTCCA
CCACTTCGCAGAACACGCTAGCGGCG---------CTGGCAGAG---ATG
etc

Also, this doesn’t appear to have any taxonomic information with it that you could use to do classification if that was of interest to you.

Pat

Sorry, for late communication.

I am ready to send my DNA samples for sequencing. I am looking forward to sequencing nifH gene with IGK3/DVV primer set. The technique may use “smallest sequencing kit (nano v2) at 2x250 with 1M reads”.
Please if you could let me know, is the technique useful for specified target gene and primer.

Thanks

Taking a quick look at this review, it looks like that’s going to give you an amplicon of ~394 bp. When you include your adapter and barcode sequences that might make joining the pairs a bit tight - you certainly won’t get the near-complete overlap that Pat recommends.

Yup. I gone through that review . Yes, IGK3 has position at 19-47 and DVV position 388-413. Could you please suggest any appropriate primer set and technique for sequencing nifH. Thanks

I can’t comment of the value of the primers in terms of target specificity or species coverage - I have no experience with the system. If you know of a shorter primer pair that give as good results as the IGK3/DVV set then that would be great.

If you can’t, and you can’t go for longer MiSeq reads, the easiest option would be to just analyse one read direction. For example, discard your reverse reads and just analyse the forwards. That would still give you about 200 - 250 bp of a 400 bp gene region, so it’s not too bad.

If that doesn’t suit, you could always just try merging the reads. If you have some control data you could evaluate the error rate over the join region and determine if it’s of sufficient quality to proceed.

A third option - and this is completely suggested off the top of my head, I have no idea how valid it is - would be to use your reference alignment to align the forward/reverse reads, and then collapse them into a single aligned sequence. This would give you a mostly-complete sequence with a gap in the middle where the poor quality regions were trimmed and would be analogous to Craig Herbold’s approach of padding the internal gap with poly-Ns. While this approach will give you a significant gap in the middle of your amplicon, I’d argue that this is no worse than what you get from the first option (single-end only). I’m not familiar with the gene though, that gap might be important and discarding it might leave you with biologically meaningless OTUs.

Hi, So i am planing to go with long read length sequencing on MiSeq platform, which delivers at least 1M PE300 reads, means that inserts up to 520bp. I think that would be good for nifH with IGK3/DVV primer set. Let me know if i am not correct. Thanks

According to my FAS the issue with the v3 quality dropping has been fixed but I haven’t tests that. Be prepared for very poor quality after ~150-200bp on each read. Try trimming your nif database to those lengths and see if you’ll be able to detect the differences you want to detect.

I’m not sure how you would get 520 nt inserts. We have repeatedly found and shown that unless the reads fully overlap you will be getting high error rates. I think you should really only expect to get good data from sequencing 300nt inserts.

Yes,It may good to do with MiSeq 250PE 10M. The reads ~330,000 per sample can be good coverage. The selected primer IGK3/DVV may perform good.