Database Curation

Grothjan · January 28, 2015, 6:12pm

Hello everyone,

I have two questions for you. I’m working with environmental samples, specifically pitcher plant fluid samples. There are a lot of unclassified sequences particularly when I ran 18S rRNA. I’m looking at the data and I’m wondering just how definite these unclassified sequences are. I’m curious if one of two things are possible.

a) Can I force sequences to only classify to known database sequences (i.e. excluding unclassified sequences)?

b) Is there a way to curate the database I’m using to remove unclassified sequences to see what those sequences would otherwise match to?

Any help you could give would be appreciated!

Thanks,

Jake

dwaite · January 28, 2015, 8:46pm

If you’re getting sequences that can’t be classified at all then they can be removed with the remove.lineage() command.

There aren’t (well, shouldn’t) be any unclassified sequences in the classification databases. Unclassified sequences are more likely to be a result of the fact the sequences can’t be accurately classified. If you’re using the default Baysian method of classify.seqs(), I’m not aware of anyone actually validating this method on 18S data sets. The original paper reported high classification rates (>98% I think) but only with 16S data (of which there is a much greater data set to train on). You might find that if you change to using a BLAST based approach this helps reduce your unclassified rate.

Grothjan · January 29, 2015, 5:19pm

Thanks for your reply.

I should have been more specific. I’m trying to classify sequences to the family level that only were originally classified to the order level for example.

dwaite · January 29, 2015, 7:16pm

Something you could try doing to improve accuracy is to align your classification database and then trim the alignment to only include positions included in your amplicons. The approach was outlined here. It’s quite easy to perform in mothur - when you use filter.seqs on your alignment you get a *.filter file (say pitcher_plant.filter) which is the lane mask for the alignment. All you need to do is to align your classification database to the same reference and then filter the database using

filter.seqs(fasta=database.align, hard=pitch_plant.filter)

You can then just repeat the classification as per normal. As long as you didn’t lose any database sequences during alignment that’s all the work you’ll need. If you find some reference sequences get discarded you’ll need to remove those entries from the taxonomy file.

Having said this - I’ve had pretty dodgy results using this filtering approach on 18S data. In my data I had a mixture of eukaryotes (the metazoan host, plant material, fungi and protozoa) so I think that there weren’t enough conserved positions between the different groups to make a good classification. If you suspect this is a problem then you could try using classify.seqs() with the parameters [method=knn, search=blast] which will hopefully give a more reliable result, although will be slower to perform.

Grothjan · January 30, 2015, 4:45pm

Thanks, I’ll take a look at the paper and give the command a try.

Topic		Replies	Views
NCBI database Commands in mothur	10	2708	November 8, 2018
PROBLEM WITH CLASSIFY.SEQ AND REMOVE.LINEAGE Commands in mothur	5	2984	March 5, 2015
Classifying Sequences Commands in mothur	5	4271	March 21, 2012
Unclassified sequences Commands in mothur	2	3382	February 10, 2014
Classify.seq could not be classified mothur bugs	7	1069	April 18, 2021

Database Curation

Related topics