Running SOP with a single fasta file containing multiple pre-processed sequences

I’m brand new to mothur, and I want to perform a sequence alignment and then build a tree following the mothur SOP. However, my starting point is a single fasta file containing ~40,000 unique sequences that were pre-processed in dada2. What is the best “entry point” in the SOP considering the current format of my data? Do I need to split up into multiple fastas? Currently, the summary.seqs command yields the following output:
mothur > summary.seqs(fasta = buffalo_seqs.fasta)

Using 1 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      0       0       0       0       1       1103
25%-tile:       0       0       0       0       1       11026
Median:         0       0       0       0       1       22052
75%-tile:       0       0       0       0       1       33078
97.5%-tile:     0       0       0       0       1       43001
Maximum:        1       18030465        18030465        0       8       44103
Mean:   2.26742e-05     408.826 408.826 0       1.00016
# of Seqs:      44103

Output File Names:

It took 2 secs to summarize 44103 sequences.

So it looks like mothur is interpreting all the bases in the file as a single sequence rather than splitting them up into the individual sequences.

Dear Claire,

if an alignment and a phylogenetic tree of your sequences is all you want then there are better ways and software out there than mothur (sorry Pat & co! :slight_smile:).

However, if you want/need to use mothur then I think you should just start with

mothur > align.seqs(fasta=buffalo_seqs.fasta, reference=silva.seed_v132.align)

and move on from there. As your sequences are already processed and unique, I guess you can skip the later unique.seqs and pre.cluster commands, as well as the chimera.vsearch if a chimera test was done by DADA2. Classify.seqs is also optional; depending if you want a taxonomic classification or not. To create the tree you can use the clearcut command in mothur using the align file.

The output after the summary.seqs command does look odd though. Have you opened the file in a text editor and checked that the sequences are in the correct fasta format?

Good luck,


I would strongly encourage you to get the fastq files. Regardless of whether you use dada, mothur, qiime, whatever. Get the raw data. When you go to publish, you’ll need those raw files to be in the SRA. That being said, I would definitely go back to the raw data and start at the top of the SOP. dada does stuff (e.g. remove singletons) that will screw up the frequency distributions.

I wonder if you actually have a fasta file since you seem to have a very long sequence and a bunch of sequences without any data in them. You can always forward your files to and we can try to help you navigate the process of bringing the files into mothur. Basically, you’d start with align.seqs and probably skip pre.cluster and chimera checking since both of those are what makes up dada.

But really, I’d go back to the raw fastq files and start over

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.