Training a V4 database from a reference

I trimmed a full length ‘seed’ reference to the V4 region using the blog on customization as follows:
Commands below:

  1. Short perl script (yes, perl…) to pull the V4 region from one full length SSU acantharia. Output as Acantharia_V4.fasta.
  2. align.seqs(fasta=Acantharia_V4.fasta, reference=silva.seed_v132.align)
  3. summary.seqs(fasta=Acantharia_V4.align)
  4. pcr.seqs(fasta=silva.seed_v132.align, start=6388, end=13861, keepdots=FALSE)
  5. Rename and format using bash (sed)

After trimming to the V4 region, none of my reads aligned to the trimmed V4 version. I then tried aligning to the full-length reads seed and got the same results; no alignments. However, when I try assigning taxonomy using the regular full length silva database, things work just fine.

Any troubleshooting advice? My end goal is to produce a V4 reference for taxonomic assignments. Preferably out of a database that comes with a taxonomy .txt file as both are required for taxonomic assignment.


What is the output from running summary.seqs(fasta=Acantharia_V4.align)?

seqname start end nbases ambigs polymer mumSeqs
KC172865.1 14291 22553 408 0 5 1

I used the correct coords in pcr seqs.
mothur > pcr.seqs(fasta=silva.seed_v132.align, start=14291, end=22553, keepdots=False)

This produced a file called silva.seed_v132.pcr.align, which contained a lot of dots and dashes and the number 100 between the ID and taxonomy that I removed using bash at the same time renaming the file.

AY217654.EscSeneg 100 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Kosakonia;
…C–G---------ΑA, etc

sed ‘s/-//g’ Test.pcrfile.fasta | sed ‘s/..//g’ | sed ‘s/.CG/CG/g’ | sed ‘s/ 100 //g’ > V4_silva.fasta

Then I attempted to classify taxonomy using this V4_silva.fasta and the file.

Everything was ‘Unknown’ using the V4_silva.fasta

What’s the output to the screen from running summary.seqs?

Using 128 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 14291 22553 408 0 5 1

2.5%-tile: 14291 22553 408 0 5 1

25%-tile: 14291 22553 408 0 5 1

Median: 14291 22553 408 0 5 1

75%-tile: 14291 22553 408 0 5 1

97.5%-tile: 14291 22553 408 0 5 1

Maximum: 14291 22553 408 0 5 1

Mean: 14291 22553 408 0 5

of Seqs: 1

It took 0 secs to summarize 1 sequences.

Output File Names:


Thanks - a few questions…

  1. Are you positive that you have V4 sequences?
  2. What happens when you classify the Acantharia_V4.fasta file against your trimmed version of the silva database?
  3. What’s the closest Acantharia seqeunce in the silva_seed database?

My concern is that you either don’t really have V4 sequences or that the database has poor representation of the Acantharia. You might try to repeat the pcr.seqs and following steps using the silva.nr_v132, which will have more sequences in it.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.