Running SOP with a single fasta file containing multiple pre-processed sequences

clairecouch · February 29, 2020, 8:14pm

Hi,
I’m brand new to mothur, and I want to perform a sequence alignment and then build a tree following the mothur SOP. However, my starting point is a single fasta file containing ~40,000 unique sequences that were pre-processed in dada2. What is the best “entry point” in the SOP considering the current format of my data? Do I need to split up into multiple fastas? Currently, the summary.seqs command yields the following output:
mothur > summary.seqs(fasta = buffalo_seqs.fasta)

Using 1 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        0       0       0       0       1       1
2.5%-tile:      0       0       0       0       1       1103
25%-tile:       0       0       0       0       1       11026
Median:         0       0       0       0       1       22052
75%-tile:       0       0       0       0       1       33078
97.5%-tile:     0       0       0       0       1       43001
Maximum:        1       18030465        18030465        0       8       44103
Mean:   2.26742e-05     408.826 408.826 0       1.00016
# of Seqs:      44103

Output File Names:
buffalo_seqs.summary

It took 2 secs to summarize 44103 sequences.

So it looks like mothur is interpreting all the bases in the file as a single sequence rather than splitting them up into the individual sequences.

Rene · March 2, 2020, 2:16pm

Dear Claire,

if an alignment and a phylogenetic tree of your sequences is all you want then there are better ways and software out there than mothur (sorry Pat & co! ).

However, if you want/need to use mothur then I think you should just start with

mothur > align.seqs(fasta=buffalo_seqs.fasta, reference=silva.seed_v132.align)

and move on from there. As your sequences are already processed and unique, I guess you can skip the later unique.seqs and pre.cluster commands, as well as the chimera.vsearch if a chimera test was done by DADA2. Classify.seqs is also optional; depending if you want a taxonomic classification or not. To create the tree you can use the clearcut command in mothur using the align file.

The output after the summary.seqs command does look odd though. Have you opened the file in a text editor and checked that the sequences are in the correct fasta format?

Good luck,

René

pschloss · March 2, 2020, 7:57pm

I would strongly encourage you to get the fastq files. Regardless of whether you use dada, mothur, qiime, whatever. Get the raw data. When you go to publish, you’ll need those raw files to be in the SRA. That being said, I would definitely go back to the raw data and start at the top of the SOP. dada does stuff (e.g. remove singletons) that will screw up the frequency distributions.

I wonder if you actually have a fasta file since you seem to have a very long sequence and a bunch of sequences without any data in them. You can always forward your files to mothur.bugs@gmail.com and we can try to help you navigate the process of bringing the files into mothur. Basically, you’d start with align.seqs and probably skip pre.cluster and chimera checking since both of those are what makes up dada.

But really, I’d go back to the raw fastq files and start over

system · March 12, 2020, 7:57pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
summary.seqs Commands in mothur	1	2136	June 20, 2014
Errors following align.seqs Commands in mothur	8	205	January 26, 2024
Trim.seqs removing all seqs (processing single-end reads) Commands in mothur	6	1204	August 28, 2020
align.seqs and screen.seqs give odd result in defined control sequence set mothur bugs	2	1030	May 5, 2017
Run the workflow with several plates? Commands in mothur	15	8272	May 26, 2013

Running SOP with a single fasta file containing multiple pre-processed sequences

Related Topics