Classifying sequences

Dear,

I was wondering about the biological relevance (and/or quality) of sequences which have ‘low’ bootstrap values on domain level (i.e. Bacteria). For example using a cutoff=80, what about sequences approaching this value (e.g. 83, 87, …)?
I’m working on Antarctic samples, so I expect “some” unknown stuff, but wouldn’t these sequences be too suspicious?

Can you somehow relate bootstrap values to similarity? Is there a possibility for obtaining similarity values using the classify.seqs command, or can you only obtain approximate values using the label function (cutoff of 0.01, 0.03, …) through clustering and consequently using classify.otu?

Lastly, because of the expected high proportion of “unclassified/unknown” organisms, using the more recent online RDP classifier resolves some unclassified sequences obtained with the mothur training set. However consequently using the files generated by RDP, doesn’t appear to work (cannot remove chloroplast lineages). Is it possible to use these files directly or do they have to be reformatted?
Or how can I obtain a newer training set to use with mothur?
I know “names” aren’t everything, but since I need to describe this unknown diversity, it would be nice to assign sequences to at least higher taxonomic levels :slight_smile:

Thanks!

Cheers :mrgreen:

Good questions…

I was wondering about the biological relevance (and/or quality) of sequences which have ‘low’ bootstrap values on domain level (i.e. Bacteria). For example using a cutoff=80, what about sequences approaching this value (e.g. 83, 87, …)?
I’m working on Antarctic samples, so I expect “some” unknown stuff, but wouldn’t these sequences be too suspicious?

Yeah, that does seem suspicious. A couple things to check. 1) Unless your sequencing is all in the same direction, you might try classifying the reverse complement of the sequence; this will be automatic in the next release. 2) What does your sequence quality look like? If it’s low or you aren’t doing much to clean up the sequences, this could be an issue. 3) Perhaps the database is limited in that area of the tree and you could manually supplement the database. 4) It’s possible that it’s not really a SSU rRNA gene sequence :slight_smile: We’ve seen cases where nonspecific DNA can get amplified with 16S primers.

Can you somehow relate bootstrap values to similarity? Is there a possibility for obtaining similarity values using the classify.seqs command, or can you only obtain approximate values using the label function (cutoff of 0.01, 0.03, …) through clustering and consequently using classify.otu?

I’m not sure I follow fully. The bootstrap values give a sense of confidence in the classification, not a similarity. So it essentially is like saying, we’re XX% confident that your sequence is similar to Bacillus, but it can’t say how similar it is. What you outline is our strategy. Classify the sequences, assign the sequences to OTUs, and then get a consensus classification for each OTU.

Lastly, because of the expected high proportion of “unclassified/unknown” organisms, using the more recent online RDP classifier resolves some unclassified sequences obtained with the mothur training set. However consequently using the files generated by RDP, doesn’t appear to work (cannot remove chloroplast lineages). Is it possible to use these files directly or do they have to be reformatted?
Or how can I obtain a newer training set to use with mothur?

Look for the updated classification to come out in the next few days. It was just updated last week. Also, you can try the mothur-compatible greengenes reference set, which has 84000 sequence and is posted here:

http://www.mothur.org/wiki/Greengenes-formatted_databases

Thanks,
Pat

Thanks for the answers, Pat!

I’m generally following along the SOP, so I guess everything remaining is of reasonable quality :mrgreen:
Something else I noticed checking for chimeras. I obtain quite different sequences running Uchime against Silva compared to the ‘self’ alternative (and in both cases actually with quite few hits). Perhaps I should remove all of them.

bootstrapping:
What I meant is that when your confidence is high (e.g. 98 %), the sequence will probably be more similar than when it was only assigned 80 times to a taxon. I know you can’t tell the exact similarity (but it would be nice to know, though :wink: ). If it would actually already make a huge difference… RDP suggests using a bootstrap of 50 % with sequences of less than 250 nt, retaining a 95 % correct classification.
So the only way to know the (minimal) similarity is to cluster them.

Anyway, removing sequences with ~80 % confidence on domain level won’t probably remove unknown bacteria-like taxa :sunglasses:

Kind regards

RDP suggests using a bootstrap of 50 % with sequences of less than 250 nt, retaining a 95 % correct classification.

This actually makes zero sense to me and I’m not sure why they recommend it. When I’ve done the same experiment, lower bootstrap support means lower correct classification. If you have shorter sequences, you’re just not able to classify as deep.

You can always run classify.seqs with method=distance and get the distance to the closest sequence in the database. This isn’t really possible with the Bayesian classifier since it’s comparing it to a model of what taxon XYZ looks like.

Pat

Dear Pat,

We’ve been using Mothur’s classify.seqs command for clustering in silico generated short read sequences to check how accurate classifications of short reads are relative to the parent full length sequences. We’ve used the bootstrap cutoff value that you recommend on the wikipage, i.e. 80%.
Our work has been reviewed and one of the reviewers is asking for a rationale why we used the 80% cutoff, and not let’s say 75 or 90%. Lowering the cutoff seems logical in that you obtain less certainty about the classification, however, why not raising it up to 90 or 95%?

Thank you in advance!
Best regards,
Jonas.

Hey Jonas,

80% is largely historical from the phylogenetics literature where it is common to use an 80% bootstrap confidence. It was also used in the Wang et al. paper. I would tell the reviewer that you’re following the convention and what was originally published. As for altering the % threshold based on sequence length I think that is unwise as dropping the threshold will make you less confident in the assignment.

Pat