Why use MSA for the reference seuqences?

dfilippo · November 5, 2015, 10:19pm

I am using Mothur to analyse a set of 16s rRNA sequences. Mothur takes about 5 hours on my modest dataset with 12K sequences (they are 1.5kbp long, but that’s not the culprit). I looked at the logs and mothur spends the most time (35-45min) reading the reference 16s sequences and aligning to them (2h for 6K sequences, 8min for 200 seqs). I am using an RDP references which turns out to be 69Gb. Peeking into the reference fasta file, I see that these are not just sequences, it looks like a multiple sequence alignment of all of them. If I remove the gaps from the reference seqs, the file size drops to 3.6Gb. What is the reason for storing MSA of the reference 16s rRNAs? Can I just use the de-gapped reference sequences to align my reads?

Thanks

pschloss · November 9, 2015, 3:34pm

Hi,

A couple of things…

Your reference must be aligned so that you can align your sequences to it. This allows you to incorporate the secondary structure of the reference sequences into your aligned sequences.
I wonder how many sequences are in your reference. The time required to read in the reference will depend on how many sequences are there. I suspect you have a ton.
The RDP reference kinda sucks. It doesn’t actually align the variable regions. When you look at the references you’ll find bases in lower case. These are bases that they couldn’t align. In contrast, we strongly encourage people to use the SILVA reference alignment. It is available on the mothur wiki. Also, you can find a paper describing this here:

ncbi.nlm.nih.gov

A high-throughput DNA sequence aligner for microbial ecology studies.

PD Schloss, PloS one, Dec 2009 14

As the scope of microbial surveys expands with the parallel growth in sequencing capacity, a significant bottleneck in data analysis is the ability to generate a biologically meaningful multiple sequence alignment. The most commonly used aligners have varying alignment quality and speed, tend to depend on a specific reference alignment, or lack a complete description of the underlying algorithm. The purpose of this study was to create and validate an aligner with the goal of quickly generating a high quality alignment and having the flexibility to use any reference alignment. Using the simple nearest alignment space termination algorithm, the resulting aligner operates in linear time, requires a small memory footprint, and generates a high quality alignment. In addition, the alignments generated for variable regions were of as high a quality as the alignment of full-length sequences. As implemented, the method was able to align 18 full-length 16S rRNA gene sequences and 58 V2 region sequences per second to the 50,000-column SILVA reference alignment. Most importantly, the resulting alignments were of a quality equal to SILVA-generated alignments. The aligner described in this study will enable scientists to rapidly generate robust multiple sequences alignments that are implicitly based upon the predicted secondary structure of the 16S rRNA molecule. Furthermore, because the implementation is not connected to a specific database it is easy to generalize the method to reference alignments for any DNA sequence.

ncbi.nlm.nih.gov

The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies.

PD Schloss, PLoS computational biology, Jul 2010 08

Pyrosequencing of PCR-amplified fragments that target variable regions within the 16S rRNA gene has quickly become a powerful method for analyzing the membership and structure of microbial communities. This approach has revealed and introduced questions that were not fully appreciated by those carrying out traditional Sanger sequencing-based methods. These include the effects of alignment quality, the best method of calculating pairwise genetic distances for 16S rRNA genes, whether it is appropriate to filter variable regions, and how the choice of variable region relates to the genetic diversity observed in full-length sequences. I used a diverse collection of 13,501 high-quality full-length sequences to assess each of these questions. First, alignment quality had a significant impact on distance values and downstream analyses. Specifically, the greengenes alignment, which does a poor job of aligning variable regions, predicted higher genetic diversity, richness, and phylogenetic diversity than the SILVA and RDP-based alignments. Second, the effect of different gap treatments in determining pairwise genetic distances was strongly affected by the variation in sequence length for a region; however, the effect of different calculation methods was subtle when determining the sample's richness or phylogenetic diversity for a region. Third, applying a sequence mask to remove variable positions had a profound impact on genetic distances by muting the observed richness and phylogenetic diversity. Finally, the genetic distances calculated for each of the variable regions did a poor job of correlating with the full-length gene. Thus, while it is tempting to apply traditional cutoff levels derived for full-length sequences to these shorter sequences, it is not advisable. Analysis of beta-diversity metrics showed that each of these factors can have a significant impact on the comparison of community membership and structure. Taken together, these results urge caution in the design and interpretation of analyses using pyrosequencing data.

Good luck,
Pat

Topic		Replies	Views
Silva database vs RDP Commands in mothur	2	3084	October 27, 2014
Aligned SILVA vs unaligned RDP Theory behind mothur	3	1199	May 3, 2023
RDP database from official website result unaligned Commands in mothur	2	317	May 14, 2023
align.seqs using silva.nr_v123 Theory behind mothur	4	4379	February 16, 2016
Building Reference Alignments Theory behind mothur	4	4936	July 22, 2011

Why use MSA for the reference seuqences?

Related topics