Training a V4 database from a reference

gollison · February 10, 2020, 11:18pm

I trimmed a full length ‘seed’ reference to the V4 region using the blog on customization as follows:
Commands below:

Short perl script (yes, perl…) to pull the V4 region from one full length SSU acantharia. Output as Acantharia_V4.fasta.
align.seqs(fasta=Acantharia_V4.fasta, reference=silva.seed_v132.align)
summary.seqs(fasta=Acantharia_V4.align)
pcr.seqs(fasta=silva.seed_v132.align, start=6388, end=13861, keepdots=FALSE)
Rename and format using bash (sed)

After trimming to the V4 region, none of my reads aligned to the trimmed V4 version. I then tried aligning to the full-length reads seed and got the same results; no alignments. However, when I try assigning taxonomy using the regular full length silva database, things work just fine.

Any troubleshooting advice? My end goal is to produce a V4 reference for taxonomic assignments. Preferably out of a database that comes with a taxonomy .txt file as both are required for taxonomic assignment.

Salute,
G

pschloss · February 11, 2020, 4:25pm

What is the output from running summary.seqs(fasta=Acantharia_V4.align)?

gollison · February 12, 2020, 10:50pm

seqname start end nbases ambigs polymer mumSeqs
KC172865.1 14291 22553 408 0 5 1

I used the correct coords in pcr seqs.
mothur > pcr.seqs(fasta=silva.seed_v132.align, start=14291, end=22553, keepdots=False)

This produced a file called silva.seed_v132.pcr.align, which contained a lot of dots and dashes and the number 100 between the ID and taxonomy that I removed using bash at the same time renaming the file.

AY217654.EscSeneg 100 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Kosakonia;
…C–G---------ΑA, etc

sed ‘s/-//g’ Test.pcrfile.fasta | sed ‘s/..//g’ | sed ‘s/.CG/CG/g’ | sed ‘s/ 100 //g’ > V4_silva.fasta

Then I attempted to classify taxonomy using this V4_silva.fasta and the silva.seed_v132.tax file.

Everything was ‘Unknown’ using the V4_silva.fasta

pschloss · February 13, 2020, 1:26pm

What’s the output to the screen from running summary.seqs?

gollison · February 14, 2020, 2:21am

Using 128 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 14291 22553 408 0 5 1

2.5%-tile: 14291 22553 408 0 5 1

25%-tile: 14291 22553 408 0 5 1

Median: 14291 22553 408 0 5 1

75%-tile: 14291 22553 408 0 5 1

97.5%-tile: 14291 22553 408 0 5 1

Maximum: 14291 22553 408 0 5 1

Mean: 14291 22553 408 0 5

of Seqs: 1

It took 0 secs to summarize 1 sequences.

Output File Names:

Acantharia_V4.summary

pschloss · February 14, 2020, 1:28pm

Thanks - a few questions…

Are you positive that you have V4 sequences?
What happens when you classify the Acantharia_V4.fasta file against your trimmed version of the silva database?
What’s the closest Acantharia seqeunce in the silva_seed database?

My concern is that you either don’t really have V4 sequences or that the database has poor representation of the Acantharia. You might try to repeat the pcr.seqs and following steps using the silva.nr_v132, which will have more sequences in it.

system · February 24, 2020, 1:28pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Customize coordinates for v4 region Commands in mothur	6	353	June 2, 2023
Customize Silva reference for V4 region Commands in mothur	6	529	August 13, 2023
Silva custom database Commands in mothur	1	3318	January 16, 2013
align.seqs: Silva Seed database mothur bugs	7	2743	June 15, 2016
Trim silva.bacteria.fasta to V3-V4 region Theory behind mothur	2	4235	December 28, 2014

Training a V4 database from a reference

of Seqs: 1

Related topics