Problems with taxonomy

Hi!

I’m a bit puzzled with taxonomy results I got from Mothur. I’ve earlier used perl-scripts to get taxonomy results and just now found the taxonomy obtion in Mothur. It’s really easy and simple to do with Mothur, but the results are really diverse with these to two methods :shock: . Unfortunately earlier studies strongly suggest, that Mothur results are the incorrect ones :? . I’ve used the same database for comparison (rdp) in both methods so what could be the problem here? It would be nice to use one and the same method for all the processing so I’d be happy to figure out what I’m doing wrong :).

Can you give us some more information please? What is the command that you are entering? What are you comparing mothur output against? Are you using the same reference set? Same taxonomy? “How” are the results different? If you could provide data for one or two sequences that would be helpful.

I used command classify.seqs(fasta=nimi.trim.pick.unique.fasta, template=silva.bacteria.fasta, taxonomy=silva.bacteria.rdp.tax, name=nimi.trim.names, group=nimi.groups). With these perl scripts I use reference database I’ve retreived from RDP, so I figured that I shoul get somewhat same results. But the results are different so that the prosentages in each bacterial groups are really different. I should expect the major groups to be Actinobacteria and Proteobacteria, but in stead Mothur puts most seqs to SR 1 and Bacteroidetes. Like in some samples proportion of SR1 is almost 70 %.

I’ll try if I can convert the files I use with the scripts to mach Mothur and if that makes any difference.


Here are two seqs, if that helps: ACACTGAGAGTTTGATCCTGGCTCAGAATCAACGCTGGCGGCGTGCCTAACACATGCAAGTCGCACGAGAAAGGGGAGCAATCCCTGAGTAAAGTGGCGCACGGGTGAGTAACACGTGAATCATCTACCTCCGAGTGGGGAATAACCTAGAGAAATCTGGGCTAATACCGCATAACACTTACGAGTCAAAGCAGCAATGCGCTTGGAGAGGAGTTCGCGGCCGATTAGCTAGTTGGCGGGGTAATG

ACACTGAGAGTTTGATCCTGGCTCAGAATCAACGCTGGCGGCGTGCCTAACACATGCAAGTCGCACGAGAAAGGGGAGCAATCCCTGAGTAAAGTGGCGCACGGGTGAGTAACACGTGAATCATCTACCTCCGAGTGGGGAATAACCTAGAGAAATCTGGGCTAATACCGCATAACACTTACGAGTCAAAGCAGCAATGCGCTTGGAGAGGAGTTCGCGGCCGATTAGCTAGTTGGCGGGGTAATG

Thank you :)!

Hmm… I’m not sure what’s going on.

For the first sequence here’s the output from the RDP website and from mothur…

RDP     Root[100%] Bacteria[100%] "Acidobacteria"[100%] Acidobacteria_Gp1[100%] Gp1[100%]
mothur Bacteria(100);Acidobacteria(99);Acidobacteria_Gp1(99);Gp1(99);unclassified;unclassified;unclassified;unclassified;

For the second sequence…

RDP     Root[100%] Bacteria[100%] "Acidobacteria"[100%] Acidobacteria_Gp1[100%] Gp1[100%]
mothur Bacteria(100);Acidobacteria(100);Acidobacteria_Gp1(100);Gp1(100);unclassified;unclassified;unclassified;unclassified;

There is bound to be some level of difference between the mothur and RDP output because the training sets are different and because there’s some level of stochasticity involved in calculating the “bootstrap” values. But, I’m not seeing what you are referring to. One thing I notice is that you are using silva.bacteria.fasta as your reference dataset. When did you pull this down? There were conflicts between the alignment and taxonomy references that I have fixed and posted to the wiki. You might try these again. Alternatively, could something be wrong with your Perl script?

Thanks for your reply :). I took the Silva-files a few days ago, so I guess that is not the problem. I’m also pretty sure that the scripts got it right, since those results are pretty much as expected in the environment where my samples are from. I’m also not expecting there to be any problem in Mothur, but in the user (=me :D) . I’m trying to get files from rdp that I could use, so that I’d have maching databases for both methods. The problem is that I can’t open the files I got from Silva, so I don’t know what those should look like… Should the fasta file used as a template be aligned or something? How about the file used for taxonomy?

Hi again,

I’m still trying to make my own taxonomy-file, but just can’t get it right :evil: . I get this warning:

Reading in the rdp.tax taxonomy… S000998630 is missing a ;, please check for other errors.
S001019602 is missing a ;, please check for other errors.
S001043913 is missing a ;, please check for other errors.
S001080968 is missing a ;, please check for other errors.

And it is the same for all 600316 seqs in the file. So obviously there is something missing, but what??? I have compered the file I have to silva.tax-file and to me they look exactly the same. Here is what I got :

S000998630 Bacteria;Acidobacteria;Holophagae;Acanthopleuribacterales;Acanthopleuribacteraceae;Acanthopleuribacter;
S001019602 Bacteria;Acidobacteria;Holophagae;Acanthopleuribacterales;Acanthopleuribacteraceae;Acanthopleuribacter;
S001043913 Bacteria;Acidobacteria;Holophagae;Acanthopleuribacterales;Acanthopleuribacteraceae;Acanthopleuribacter;

Mothur still reads the file and tries to generate search database, but then I get this:


S002178718 is missing a ;, please check for other errors.
S002178728 is missing a ;, please check for other errors.
S002178832 is missing a ;, please check for other errors.
DONE.
Generating search database… Error: St9bad_alloc has occurred in the KmerDB class function addSequence. Please contact Pat Schloss at mothur.bugs@gmail.com, and be sure to include the mothur.logFile with your inquiry.

The command I’m using is:
mothur > classify.seqs(fasta=episeqs.trim.unique.fasta, template=rdp.fasta, taxonomy=rdp.tax, name=episeqs.trim.names, group=episeqs.groups)

Please help :cry: !

mothur creates shortcut files for the taxonomy files to save time. If you modified the taxonomy file, but have not removed the .tree.train, .tree.sum, .prob and .numNotzero files associated with your taxonomy file, mothur may be reading those instead of remaking them.

Thanks for the suggestion :slight_smile: , but unfortunately that didn’t help.

Is it correct, that to compete this command Mothur needs aligned template file (fasta) and taxonomy file (tax) and also matcing unaligned fasta file? I think that the problem could be that when I got the sequences from RDP, there are names after accession numbers for each sequence. I can remove the names from unaligned file, but the aligned file is SOOOO huge that I can’t open it with my computer :shock: . Can you tell whether those names in the aligned file are the problem?

I checked now that the Silva reference that I used with Mothur has ~ 15 000 seqs while my own reference has 270 000 seqs, so that migth could explain the differences…?

The only search method that requires an aligned version is the distance-based k-nearest neighbor approach. So if you’re using the bayesian approach, you don’t need to use aligned sequences. Also, have you put semicolons at the end of each line in the taxonomy file as the error suggests?

Yes, adding ; to the end was the first thing I did. Still no luck completing classification. I’ll keep trying later…