mothur crashes using classify.seqs and UNITE database

hello, I am running into a problem with a dataset of ITS sequences. I ran them through the Miseq SOP (make.contigs, screen.seqs, unique.seqs, count.seqs, pre.cluster, chimera.uchime, remove.seqs) without aligning them and using needleman during the pre.cluster step. Now I reached the point where I would like to run the command classify.seqs, using the UNITE database.

During the first attempts mothur crashed. I noticed that mothur began to use a lot of RAM. Has anyone experienced this problem with UNITE?
I ran a larger 16S dataset, following the MiSeq SOP (aligning to SILVA) and the sequences were classified without such extreme memory consumption. Could the cause of mothur crashing with UNITE be that the ITS sequences are not aligned? I do not understand this because the reference files for 16S are also not aligned.

Best wishes, guido

Hm, I’ve classify.seqs with the UNITE species hypothesis database from here and it worked fine, although I had to format the taxonomy file a bit. Can you paste a few lines from your database fasta and tax files? It could just be a formatting error.

You’re correct that the files don’t need to be aligned, so that won’t be the problem.

Thanks for your reply. You were right, it was a formatting error. I had the dataset modified with a few extra sequences. I tried it now with the original dataset and that one works.

Dear All,
I’m too encountering a similar probelm of mothur getting crashed while running classify.seqs with UNITE database. I saw that the problem gets solved by formatting the taxonomy file. But how to format a taxonomy file? It would be great if somebody helps me in this regard :frowning:

Regards,

Hema

The format for the taxonomy file is just a simple text file where each line has the format:

SequenceID[\t]Domain;Phylum;Class;Order;Family;Genus;Species

That should be a tab separating the SequenceID from the taxonomy. You can have as few or as many taxonomic ranks as you like in the taxonomy string, but they must all be the same length, so if you add subclasses or superfamilies to a sequence, you need to account for this in all the other ones. The DNA sequences should just be in a standard fasta file, with matching SequenceIDs to the taxonomy file.