remove.lineage output files not in sync

enikos · October 14, 2014, 11:44am

Hi,
I have seen that there has been a thread about this in 2012, but jenz seemed to have different problems and error messages than I do.

I follow slavishly the SOP, except that I did not remove my Mock sample (I analyze it as the other samples), and I use the greengenes files for classification (but not for alignment, there I stick to the silva database).

So after I have run classify.seqs, everything seems to be OK, and then I run remove.lineage, and it still seems to be OK but in fact it is not - the output files are not in sync.

The first sign of problems was that when I run summary.seqs I don’t get information about unique sequences, that line is simply missing from the summary.

And then I try to run cluster.split and I get an error message saying that a sequence is missing from my fasta file. “This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate and error.” When I had a look into the fasta, count and taxonomy files (both before and after running cluster.split), I could see that this particular sequence is there in all the before-files, and it is removed from the fasta file, but it is not removed from the taxonomy and the count files.
The sequence is classified as:
k__Bacteria(99);p__Bacteroidetes(85);unclassified;unclassified;unclassified; unclas

What’s wrong?
Did I mess up something?
How can I solve this problem?

Thanks a lot,
EnikÃ¶

westcott · October 14, 2014, 7:05pm

Does your count file contain group information or just raw counts? For example, are the lines like this:

Representative_Sequence total F003D000 F003D002 …
GQY1XT001C296C 6020 407 980 …
GQY1XT001CJLAE 491 341 41 …
…

or

Representative_Sequence total GQY1XT001C296C 6020 GQY1XT001CJLAE 491 ...

The second type of count table has a bug with the remove.lineage command that will be fixed in our next release. There are 2 workarounds you can use. You can use a names file instead of a count file or you can assign all your sequences to one group using the make.group command.

make.group(fasta=yourFastaFile, group=A)
make.table(name=yourNameFile, group=groupFileCreatedByMakeGroup)

enikos · October 15, 2014, 5:48am

Hi,
thank you for the fast reply!

My count table looks like your first example, the group information is actually the ID of my samples (R1E-0603, R1E-0610, etc.)
Can the problem be that I have hyphen (-) in my sample IDs?

Since the group names are my sample IDs I guess assigning all sequences to one group is not an option for me. Or is it? Am I going to be able to identify my samples if I assign everything to one group?

About the names file: I have only one names file, from the beginning of the analysis. It’s called “stability.trim.contigs.good.names” (I usually don’t bother to rename anything). But at this point the files usually are “good” and “unique” and “pick” and a number of other things, so I thought that old names file is probably not going to match my fasta and tax files. I have tried it anyway, and the remove.lineage seemed to work, but then with cluster.split I ended up with an endless row of error messages saying “error in reading your fastafile, at position -1. Blank name”. Can I somehow create a names file that matches the other 2 files?

Thank you for your help,
EnikÃ¶

westcott · October 16, 2014, 1:57pm

Mothur can be picky about the ‘-’ character in group names. This will be especially problematic if you try to select a subset of the groups later on. Have you tried changing the ‘-’ to ‘_’ Alternatively, could you be accidentally giving mothur the wrong filename? The filenames can get lengthy and a common mistake we see when this type of error occurs is someone gives a valid filename that is from screening step above in the pipeline. To oversimplify: final…fasta with final…pick.names.

enikos · October 17, 2014, 8:14am

I’m not using the actual filenames, I’m using “current” (mainly because I’m lazy to type those long names).
I have manually removed the problematic sequence from the tax and group files, it’s not a pro solution, but it works, so for the time being I’m happy with it
In this particular dataset, I don’t need subsets (lucky me), but in my next analysis (and in the future in general) I will make sure to avoid hyphens in the file names.
Thank you for your help!
EnikÃ¶

westcott · January 28, 2016, 7:44pm

Thanks for bringing this to our attention. I will add a feature request to change any ‘-’ characters found in sequence names to be ‘_’ characters.

Topic		Replies	Views
Remove.lineage output file Commands in mothur	4	854	July 13, 2019
Remove.lineages error Commands in mothur	7	1037	November 6, 2018
Remove.lineage files not in synch (tax,group) mothur bugs	10	16690	January 23, 2012
remove.lineage removing all seaquences from FASTA file mothur bugs	2	1107	August 5, 2017
Remove.lineage: accnos file missing Commands in mothur	9	1792	November 3, 2019

remove.lineage output files not in sync

Related topics