Apologies if this is a stupid q, but I am struggling to understand the classification cutoffs etc within Mothur.
When I reach the classify.seqs step - the protocol suggests setting the cutoff=80.
It was always my understanding that the minimum cutoff for classifying bacteria to genus is 97, so why is 80 used?
Further down the protocol - the cutoff value changes when classifying OTU’s (e.g. 0.15, 0.03) or when using the phylotype method (label=1 - does this equate to 97% confidence?).
Can somebody please explain to me which is the correct cutoff to use? Just a little confused
It looks like you’re confusing a few different things here. Firstly, classification - there are two common approaches to classifying 16S data. The first is a BLAST-like approach, which performs alignments between query sequences and a target database. For this sort of analysis you will be able to obtain a sequence similarity score (percent identity or some such) for evaluating your matches.
The other way, which is the default in mothur, is to use Bayesian analysis of kmer profiles in your sequence (originally described here). This analysis returns the most likely match, but tells you nothing about how good the match is. To get support for a hit, these classifiers bootstrap the results so that you know you’re getting a consistent match. This is what the cutoff=80 means, it’s not 80% sequence similarity (which would be rubbish), but a requirement that to be considered classified to a particular level, >80% of the returned results have to have the same taxonomy.
Where you talk about cutoff 0.15 - that sounds like you’re talking about OTU clustering (which is often confused with classification, although they’re completely different concepts). OTU clustering is an approach that aggregates sequences that share a certain amount of similarity to each other. It’s pretty standard with the OTU clustering process (dist.seqs -> cluster -> make.shared) to create OTUs of sequences that shared >97% sequence similarity, which is where the 0.03 comes in (0.97 similarity = 0.03 difference).
Phylotyping is an alternate method of clustering data, rather than use sequence similarity you just group sequences that were assigned to the same taxonomy. Phylotype (label=1) groups your sequences to the most specific classification rank your sequences were classified to (genus in the RDP database, species in Greengenes).
I hope that clears things up.
Thank you so much for your very helpful reply. I really appreciate it and it has clarified the issue perfectly.
One question - many papers in the literature use the blast based classification methods, but is the Bayesian approach more reliable for miseq data? Is bootstrapping more likely to give more confidence?
Hm, in my experience they’re very comparable. Around 2011/2012 I was involved in a project looking at testing out the efficacy of different classification schemes. We tried different combinations of BLAST/Bayesian classification, RDP/SILVA/GreenGenes reference database, full-length or short fragment query sequences. The biggest point of difference was in the database employed, with the accuracy of BLAST vs Bayesian classification (at the default/recommended settings) being within 1-2% of each other when using the same database.
I don’t know about Bayesian classification becoming more popular, it’s probably more a matter of speed. BLAST is comparatively slow, but even 100-120k sequences (like you’d typically get from a GS Junior sequencing run) it usually only takes a few hours to push a BLAST classification through on a standard desktop computer. Since MiSeq has upped that to millions of sequences, the time factor becomes more limiting.
Thank you very much again for your helpful reply. I really appreciate you taking the time to explain this.