remove.lineage output files not in sync

I have seen that there has been a thread about this in 2012, but jenz seemed to have different problems and error messages than I do.

I follow slavishly the SOP, except that I did not remove my Mock sample (I analyze it as the other samples), and I use the greengenes files for classification (but not for alignment, there I stick to the silva database).

So after I have run classify.seqs, everything seems to be OK, and then I run remove.lineage, and it still seems to be OK but in fact it is not - the output files are not in sync.

The first sign of problems was that when I run summary.seqs I don’t get information about unique sequences, that line is simply missing from the summary.

And then I try to run cluster.split and I get an error message saying that a sequence is missing from my fasta file. “This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate and error.” When I had a look into the fasta, count and taxonomy files (both before and after running cluster.split), I could see that this particular sequence is there in all the before-files, and it is removed from the fasta file, but it is not removed from the taxonomy and the count files.
The sequence is classified as:
k__Bacteria(99);p__Bacteroidetes(85);unclassified;unclassified;unclassified; unclas

What’s wrong?
Did I mess up something?
How can I solve this problem?

Thanks a lot,

Does your count file contain group information or just raw counts? For example, are the lines like this:

Representative_Sequence total F003D000 F003D002 …
GQY1XT001C296C 6020 407 980 …
GQY1XT001CJLAE 491 341 41 …


Representative_Sequence total GQY1XT001C296C 6020 GQY1XT001CJLAE 491 ...

The second type of count table has a bug with the remove.lineage command that will be fixed in our next release. There are 2 workarounds you can use. You can use a names file instead of a count file or you can assign all your sequences to one group using the command., group=A)
make.table(name=yourNameFile, group=groupFileCreatedByMakeGroup)

thank you for the fast reply!

My count table looks like your first example, the group information is actually the ID of my samples (R1E-0603, R1E-0610, etc.)
Can the problem be that I have hyphen (-) in my sample IDs?

Since the group names are my sample IDs I guess assigning all sequences to one group is not an option for me. Or is it? Am I going to be able to identify my samples if I assign everything to one group?

About the names file: I have only one names file, from the beginning of the analysis. It’s called “stability.trim.contigs.good.names” (I usually don’t bother to rename anything). But at this point the files usually are “good” and “unique” and “pick” and a number of other things, so I thought that old names file is probably not going to match my fasta and tax files. I have tried it anyway, and the remove.lineage seemed to work, but then with cluster.split I ended up with an endless row of error messages saying “error in reading your fastafile, at position -1. Blank name”. Can I somehow create a names file that matches the other 2 files?

Thank you for your help,

Mothur can be picky about the ‘-’ character in group names. This will be especially problematic if you try to select a subset of the groups later on. Have you tried changing the ‘-’ to ‘_’ Alternatively, could you be accidentally giving mothur the wrong filename? The filenames can get lengthy and a common mistake we see when this type of error occurs is someone gives a valid filename that is from screening step above in the pipeline. To oversimplify: final…fasta with final…pick.names.

I’m not using the actual filenames, I’m using “current” (mainly because I’m lazy to type those long names).
I have manually removed the problematic sequence from the tax and group files, it’s not a pro solution, but it works, so for the time being I’m happy with it :slight_smile:
In this particular dataset, I don’t need subsets (lucky me), but in my next analysis (and in the future in general) I will make sure to avoid hyphens in the file names.
Thank you for your help!

Thanks for bringing this to our attention. I will add a feature request to change any ‘-’ characters found in sequence names to be ‘_’ characters.