I am using Mothur to analyse a set of 16s rRNA sequences. Mothur takes about 5 hours on my modest dataset with 12K sequences (they are 1.5kbp long, but that’s not the culprit). I looked at the logs and mothur spends the most time (35-45min) reading the reference 16s sequences and aligning to them (2h for 6K sequences, 8min for 200 seqs). I am using an RDP references which turns out to be 69Gb. Peeking into the reference fasta file, I see that these are not just sequences, it looks like a multiple sequence alignment of all of them. If I remove the gaps from the reference seqs, the file size drops to 3.6Gb. What is the reason for storing MSA of the reference 16s rRNAs? Can I just use the de-gapped reference sequences to align my reads?
Your reference must be aligned so that you can align your sequences to it. This allows you to incorporate the secondary structure of the reference sequences into your aligned sequences.
I wonder how many sequences are in your reference. The time required to read in the reference will depend on how many sequences are there. I suspect you have a ton.
The RDP reference kinda sucks. It doesn’t actually align the variable regions. When you look at the references you’ll find bases in lower case. These are bases that they couldn’t align. In contrast, we strongly encourage people to use the SILVA reference alignment. It is available on the mothur wiki. Also, you can find a paper describing this here: