Mothur training set size vs rdp database


I was curious how the mothur rdp training set is generated? It contains about 10k sequences vs over 3 million in the RDP and I was wondering how one leads to another? I’ve done some googling, looked through the mothur wiki and paper, maybe I missed it, but I can’t seem to find the answer.


Hi there,

Here’s how the trainset is formatted for mothur:

Keep in mind that the training set is supposed to be manually curated so that it is correct. In my understanding, they then run the rest of the sequences through the classifier to get their classifications.