classify.seqs had "xxx could not be classifed" warning

hi,

I follow mothur miseq SOP and trim the coding according to my data set, it’s always a smooth ride. Yet this time I got the warning when I was doing classify.seqs, saying “xxx could not be classifed, you can remove taxon=unknown etc…” the rest process still went through. Although I was wondering why this 391 sequences could not be classified?

I’ve been using v4 region to be amplified, sequenced and classified OTUs, but for this data set, it’s v4-v5 region. I had to trim the silva pcr bacteria data set for v4-v5. Here was my protocol, in case (but not likely) that caused the problem:

  1. got ecoli 16s rRNA sequence and named it ecoli.16srrna.fasta
  2. made oligo file with v4-v5 primer
    forward:GTGCCAGCMGCCGCGG
    reverse:CCGTCAATTCMTTTRAGTTT
  3. pcr.seqs(fasta=ecoli.16srrna.fasta, oligos=pcrTest.oligos)
    align.seqs(fasta=ecoli.16srrna.pcr.fasta, reference=silva.bacteria.fasta)
    summary.seqs(fasta=ecoli.16srrna.pcr.align)

Start End NBases Ambigs Polymer NumSeqs Minimum: 13858 27654 376 0 6 1 2.5%-tile: 13858 27654 376 0 6 1 25%-tile: 13858 27654 376 0 6 1 Median: 13858 27654 376 0 6 1 75%-tile: 13858 27654 376 0 6 1 97.5%-tile: 13858 27654 376 0 6 1 Maximum: 13858 27654 376 0 6 1 Mean: 13858 27654 376 0 6 # of Seqs: 1
  1. pcr.seqs(fasta=silva.bacteria.fasta, start=13858, end=27654, keepdots=F, processors=8)
    change the name to silva.v4_v5.fasta

summary.seqs(fasta=silva.v4_v5.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 13298 353 0 3 1
2.5%-tile: 1 13796 370 0 4 374
25%-tile: 1 13796 374 0 4 3740
Median: 1 13796 375 0 5 7479
75%-tile: 1 13796 376 0 5 11218
97.5%-tile: 1 13796 378 1 6 14583
Maximum: 3 13796 434 5 9 14956
Mean: 1.00053 13796 374.886 0.0875234 4.74318

of Seqs: 14956

Then I continued to use mothur SOP
align.seqs(fasta=current, reference=silva.v4_v5.fasta)
summary.seqs(fasta=DH.trim.contigs.good.unique.align, count=DH.trim.contigs.good.count_table)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1 13796 372 0 4 63375
25%-tile: 1 13796 374 0 4 633742
Median: 1 13796 376 0 5 1267484
75%-tile: 1 13796 376 0 5 1901225
97.5%-tile: 1 13796 378 0 8 2471592
Maximum: 13796 13796 379 0 39 2534966
Mean: 2.60451 13794.3 375.102 0 4.94335

of unique seqs: 787445

total # of seqs: 2534966

screen.seqs(fasta=DH.trim.contigs.good.unique.align, count=DH.trim.contigs.good.count_table, summary=DH.trim.contigs.good.unique.summary, start=1, end=13796, maxhomop=8, processors=8)

summary.seqs(fasta=DH.trim.contigs.good.unique.good.align, count=DH.trim.contigs.good.good.count_table, processors=12)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 13796 347 0 3 1
2.5%-tile: 1 13796 372 0 4 63102
25%-tile: 1 13796 374 0 4 631014
Median: 1 13796 376 0 5 1262028
75%-tile: 1 13796 376 0 5 1893041
97.5%-tile: 1 13796 378 0 8 2460953
Maximum: 1 13796 379 0 8 2524054
Mean: 1 13796 375.187 0 4.93709

of unique seqs: 781854

total # of seqs: 2524054

filter.seqs(fasta=DH.trim.contigs.good.unique.good.align, vertical=T, trump=.)
Length of filtered alignment: 898
Number of columns removed: 12898
Length of the original alignment: 13796
Number of sequences used to construct filter: 781854

unique.seqs(fasta=DH.trim.contigs.good.unique.good.filter.fasta, count=DH.trim.contigs.good.good.count_table)
pre.cluster(fasta=DH.trim.contigs.good.unique.good.filter.unique.fasta, count=DH.trim.contigs.good.unique.good.filter.count_table, diffs=2)

chimera.uchime(fasta=DH.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=DH.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
remove.seqs(fasta=current, accnos=current)
summary.seqs(fasta=current, count=current)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 898 349 0 3 1
2.5%-tile: 1 898 372 0 4 56676
25%-tile: 1 898 374 0 4 566751
Median: 1 898 376 0 5 1133502
75%-tile: 1 898 376 0 5 1700253
97.5%-tile: 1 898 378 0 8 2210328
Maximum: 1 898 379 0 8 2267003
Mean: 1 898 375.179 0 4.92978

of unique seqs: 303002

total # of seqs: 2267003

classify.seqs(fasta=DH.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=DH.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.nr_v123.align, taxonomy=silva.nr_v123.tax, cutoff=80)

Here is where the [WARNING] pop up:

[WARNING]: HISEQ_275_HNVH3BCXX_2_1114_2050_19147 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.

There are 391 sequences could not be classified.

So what could be the problem, please?

Thanks much for the help. Sorry for the length of the post, I tried to provide enough details for diagnose :smiley:

From my understanding it’s not abnormal that some of the sequences cannot be classified. Some erroneous sequences will always make it past your quality filtering. If only 391 sequences out of >2 million sequences cannot be classified I would not be concerned.

If you are concerned though, have you tried looking at what the sequences are that could not be identified? Do they look like V4-5 16S sequences?

Richard

Hi Richard,

Yeah I agree, 391/millions reads cannot be classified is not so bad :smiley: I was more like wondering why. I took a look of those sequences, not much different to the rest. I will take a closer look through. Thanks