Issues with customizing the reference database in Miseq SOP

Hi. i’m doing analysis with my data with Miseq SOP.

My question is if i can use screen.seqs commands after trimming reference database.

First i wrote down what i’ve done below.

To customize the reference database,

  1. I downloaded E.coli 16S rRNA gene sequence(GenBank: J01859.1) from NCBI.

  2. I trimmed the downloaded E.coli 16S rRNA gene sequence with my oligo files.
    mothur > pcr.seqs(fasta=e.coli_16S.fasta, oligos=primer.oligos).

  3. And aligned the trimmed E.coli sequence against the Silva.seed.v132 reference database
    mothur > align.seqs(fasta=e.coli_16S.pcr.fasta, reference=silva.seed_v132.align).

  4. Summarized the result
    mothur > summary.seqs(fasta=e.coli_16S.pcr.align).

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 6428 23440 427 1 6 1
2.5%-tile: 6428 23440 427 1 6 1
25%-tile: 6428 23440 427 1 6 1
Median: 6428 23440 427 1 6 1
75%-tile: 6428 23440 427 1 6 1
97.5%-tile: 6428 23440 427 1 6 1
Maximum: 6428 23440 427 1 6 1
Mean: 6428 23440 427 1 6
# of Seqs: 1

It took 0 secs to summarize 1 sequences
Output File Names:
/Users/uihyeontae/Downloads/20200618_re/e.coli_16S.pcr.summary
  1. Customized the reference database with my start/end point
    mothur > pcr.seqs(fasta=silva.nr_v132.align, start=6428, end=23440, keepdots=F).

  2. Summarized the result
    mothur > summary.seqs(fasta=silva.nr_v132.pcr.align).

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 15342 308 0 3 1
2.5%-tile: 2 17012 397 0 4 5328
25%-tile: 2 17012 402 0 4 53280
Median: 2 17012 422 0 5 106560
75%-tile: 2 17012 426 0 6 159840
97.5%-tile: 2 17012 567 1 7 207792
Maximum: 3825 17012 1628 5 16 213119
Mean: 2 17011 429 0 5

# of Seqs: 213119

It took 50 secs to summarize 213119 sequences.
Output File Names:
/Users/uihyeontae/Downloads/20200618_re/silva.nr_v132.pcr.summary

As my oligos file contains 341F/850R primer sequences to cover V3-V4 regions(unfortunately…) about 464 bp, i felt odd that the result showed it has maximum 1628 bp (Maximum nbases above).
So i wondered if i should use screen.seqs commands to remove that part.
For example,
mothur > screen.seqs(fasta=silva.nr_v132.pcr.align, maxlength=500).

Thanks
Uihyeon Tae.

Good work! The 1628 certainly looks problematic.Going to 500 might be a bit short. Could you look at what the sequences are that are 500+ nt long? Maybe some are good?

Pat

Thanks for your feedback!
When i looked at the sequences which are over 500nt, they were most Eukaryote.

The examples which are some of sequences over 500 nt are below.

APFE030506153.A6LT1338 95.85 Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Tracheophyta;Spermatophyta;Pinophyta;

MTTB01000006.ZeaM1627 98.39 Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Tracheophyta;Spermatophyta;Magnoliophyta;Liliopsida;Poales;Zea;

GEHE01002856.K0FSpec2 98.11 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricales;Agaricaceae;

HM581707.UncM8467 98.88 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniphycidae;Peridiniales;Heterocapsa;

JN936466.Ay3Biocu 91.89 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Annelida;Polychaeta;Scolecida;Terebellida;

KT722407.FF9Phil2 96.13 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Cnidaria;Hydrozoa;Hydroidolina;Leptothecata;

KC671663.UncFu419 97.51 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Leotiomycetes;Helotiales;Incertae Sedis;Tetracladium;

KJ763294.UncE1176 89.55 Eukaryota;SAR;Alveolata;Protalveolata;Syndiniales;Syndiniales Group II;

The sequence which has 1628 nt is also Eukaryote.

KR149268.PrpNiti2 82.18 Eukaryota;Archaeplastida;Rhodophyceae;Bangiales;Pyropia;

As my sample is about bacterial 16S rRNA gene, i think that they can be removed. But it would be quite short to get shorter than 500 as your feedback, i reset to 567 which are Nbases of 97.5%-tile.

mothur > summary.seqs(fasta=silva.nr_v132.pcr.align)

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	15342	308	0	3	1
2.5%-tile:	2	17012	397	0	4	5328
25%-tile:	2	17012	402	0	4	53280
Median: 	2	17012	422	0	5	106560
75%-tile:	2	17012	426	0	6	159840
97.5%-tile:	2	17012	567	1	7	207792
Maximum:	3825	17012	1628	5	16	213119
Mean:	2	17011	429	0	5
# of Seqs:	213119

It took 43 secs to summarize 213119 sequences.

Output File Names:
/Users/uihyeontae/Downloads/20200618_re/silva.nr_v132.pcr.summary

mothur > screen.seqs(fasta=silva.nr_v132.pcr.align, maxlength=567)

mothur > rename.file(input=silva.nr_v132.pcr.good.align, new=silvaV3V4.fasta).

mothur > summary.seqs(fasta=silvaV3V4.fasta).

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs

Minimum: 1 15342 308 0 3 1

2.5%-tile: 2 17012 396 0 4 5200

25%-tile: 2 17012 402 0 4 51997

Median: 2 17012 421 0 5 103993

75%-tile: 2 17012 426 0 5 155989

97.5%-tile: 2 17012 565 1 7 202786

Maximum: 3825 17012 567 5 16 207985

Mean: 2 17011 425 0 5

# of Seqs: 207985

It took 42 secs to summarize 207985 sequences.

Output File Names:

/Users/uihyeontae/Downloads/20200618_re/silvaV3V4.summary

But, i still have a question.
I found the step that can remove undesirable sequences with remove.lineage command.
So is it more good not to use screen.seqs (mothur > screen.seqs(fasta=silva.nr_v132.pcr.align, maxlength=567) as there is a final quality control step?

Thanks
Uihyeon Tae.

Hi there,

The remove.lineage command is more for cases where there’s a specific taxon that you want to remove rather than based on the characteristic of the sequence. I think screen.seqs is what you want to use.

Pat

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.