Custom Taxonomy File

Not sure this is the best place to put this comment, but I did not know where else to post.

In short, I am trying to add some unclassified organisms, that we know have a specific function (i.e. PAOs, Anammox, etc), to either the silva or RDP tax files. I know some of the sequences in my files correspond to these groups, because I ran seqmatch in RDP with some select sequences. I have generated a data base of these sequences and added them to the taxonomy file as below.

tax file
AY254882_S000404726 Bacteria;“Anammox”;AY254882;AY254882;AY254882;AY254882;
AY254883_S000404727 Bacteria;“Anammox”;AY254883;AY254883;AY254883;AY254883;
AY257181_S000404804 Bacteria;“Anammox”;AY257181;AY257181;AY257181;AY257181;
.
. and so on

When I run classify seqs only using my “functional” taxonomy file I find sequences classified to my “functional” library. When I add these sequences to the RDP library, I do not see any sequences classified to this library. I know that I could take all the unclassified sequences or unclassified bacteria out and reclassify, but I am working/will be working with many data sets and wold much prefer to be able to add some functional groups to the database.

Any thoughts?

Thank you,

Colin

Hi Colin,

I think the problem is that the classifier is confused by your training set. What do the sequences get classified as? I think the reason the classifier is confused is because relatives of these bacteria already exist in the training set (Planctomycetes in this example). I’d encourage you to make the taxonomy more specific. For example…


AY254882_S000404726 Bacteria;Planctomycetes; Planctomycetia;Candidatus_Brocadiales;Candidatus_Brocadiaceae;Candidatus_Scalindua; etc.

Hope this helps,
Pat

Hi Pat,

Thank you for the response. I was eventually going to move to that point, but I did not want to spend the time curating the sequences for my first pass. But I should have just started there as you suggested.

Colin

Let us know how this turns out for you. One of the problems with the RDP trainset is that there are a number of poorly populated taxa because they are somewhat slavish to Bergey’s taxonomy. An alternative might be to try our attempt to recreate the greengenes training set.