Database Curation

Hello everyone,

I have two questions for you. I’m working with environmental samples, specifically pitcher plant fluid samples. There are a lot of unclassified sequences particularly when I ran 18S rRNA. I’m looking at the data and I’m wondering just how definite these unclassified sequences are. I’m curious if one of two things are possible.

a) Can I force sequences to only classify to known database sequences (i.e. excluding unclassified sequences)?

b) Is there a way to curate the database I’m using to remove unclassified sequences to see what those sequences would otherwise match to?

Any help you could give would be appreciated!

Thanks,

  • Jake

If you’re getting sequences that can’t be classified at all then they can be removed with the remove.lineage() command.

There aren’t (well, shouldn’t) be any unclassified sequences in the classification databases. Unclassified sequences are more likely to be a result of the fact the sequences can’t be accurately classified. If you’re using the default Baysian method of classify.seqs(), I’m not aware of anyone actually validating this method on 18S data sets. The original paper reported high classification rates (>98% I think) but only with 16S data (of which there is a much greater data set to train on). You might find that if you change to using a BLAST based approach this helps reduce your unclassified rate.

Thanks for your reply.

I should have been more specific. I’m trying to classify sequences to the family level that only were originally classified to the order level for example.

Something you could try doing to improve accuracy is to align your classification database and then trim the alignment to only include positions included in your amplicons. The approach was outlined here. It’s quite easy to perform in mothur - when you use filter.seqs on your alignment you get a *.filter file (say pitcher_plant.filter) which is the lane mask for the alignment. All you need to do is to align your classification database to the same reference and then filter the database using

filter.seqs(fasta=database.align, hard=pitch_plant.filter)

You can then just repeat the classification as per normal. As long as you didn’t lose any database sequences during alignment that’s all the work you’ll need. If you find some reference sequences get discarded you’ll need to remove those entries from the taxonomy file.

Having said this - I’ve had pretty dodgy results using this filtering approach on 18S data. In my data I had a mixture of eukaryotes (the metazoan host, plant material, fungi and protozoa) so I think that there weren’t enough conserved positions between the different groups to make a good classification. If you suspect this is a problem then you could try using classify.seqs() with the parameters [method=knn, search=blast] which will hopefully give a more reliable result, although will be slower to perform.

Thanks, I’ll take a look at the paper and give the command a try.