classification leads to many unclassified

Hi,
I’ve classified my sequences using get.oturep (0.05 cutoff) and classify.seqs (bayesian method) using a bootstrap cutoff of 80 or 60%. When I look at genus within phylums I can get 40-85% unclassified genus within a phylum (i.e. Firmicutes: 86% of my genus are unclassified). I am currently using the non-redundant silva database (v.102). Is there a way to decrease the number of unclassified genus within a phylum?

I suspect your problem is your read length. If you have short sequences, then you will be less likely to classify your sequences as deep as if you had longer reads. Another potential issue, although not as significant is sequence quality. Of course, another issue is what type of environment you are sampling - if you’d expect it to have a lot of novel taxa then the classifier is likely to fail. Some suggestions…

  1. Try some of the different reference taxonomies we provide (or that you can get from an ARB database). Often these vary in their ability to classify for various parts of the tree.

  2. Pursue an OTU-based approach and then classify your OTUs. This is the downfall of phylotyping - you can only classify what’s been seen before.

Hi Pat,
the following is a summary of my precluster.fasta file…I’m not sure what read length would be considered short…These samples were taken from cattle rumen and feces so it is very likely that there are many unclassifieds but 86% of firmicutes being unclassified seemed very high.

I did follow an OTU-based approach (get.oturep and classified the otu reps using the non-redundant silva database (with RDP taxonomy). Are you suggesting I try and different database or different reference taxonomy file?

mothur > summary.seqs(fasta=dennisLabeledFinal.pick.trim.unique.good.filter.unique.precluster.fasta)

Start End NBases Ambigs Polymer

Minimum: 1 919 252 0 3
2.5%-tile: 1 919 271 0 4
25%-tile: 1 919 285 0 5
Median: 1 919 291 0 5
75%-tile: 1 919 298 0 5
97.5%-tile: 1 919 313 0 6
Maximum: 3 919 359 0 8

of Seqs: 34496

Thanks for the help!

Well maybe, 250 to 350 bases may be too short to get good classification from the Firmicutes. Our analysis shows that with shorter sequences, Firmicutes do not classify as deep as other groups. You might try some of the other *.tax files that are available with the silva reference files. Also, keep in mind that a 250 bp read will not classify the same as a 350 bp read and so it is more appropriate to get all your sequences to be about the same length with the filter.seqs command so you’re comparing like to like.

Hi Pat,
So if I understand correctly, I need to remove all the gaps in my fasta file and make the nbase length all equal for all sequences? If so, the sequences will have different start and end positions? Which options would I need to include in filter.seqs? would the following be appropriate?

filter.seqs(fasta=… vertical=F)

Also at which point in my analysis should I run this command? i.e which fasta file should i use? fasta file created afer pre.cluster?

This is the order of commands I have been using:
align
screen
filter
pre.cluster
dist.seqs
cluster
get.oturep
classify

Thanks again!

Sorry - they should overlap over the same alignment coordinates, and won’t necessarily be the same length, but they should be close. No need to worry about the gaps for running classify.seqs.

Here’s the order I suggest…

trim.seqs
unique.seqs
align.seqs
screen
filter(vertical=T, trump=.)
unique.seqs
pre.cluster
chimera.slayer
dist.seqs
cluster
get.oturep
classify

I’ve updated the Costello Analysis to reflect how we’re doing this in our lab.