Remove.lineage files not in synch (tax,group)

I completed an anlaysis, that used more or less the SOP from mothur wiki.(Didn’t have the remove contaminants step, few other mods)
Then as I saw that the taxonomy contained a lot of sequences that were unclassified to the highest levels of taxonomy - I
decided to run again an analysis, but with these sequences removed.
I started from the point of SOP .final. files. Just in case I ran the classify.seqs command again.
So …
classify.seqs(fasta=CNP.final.fasta, template=trainset6_032010.rdp.fasta, taxonomy=trainset6_032010.rdp.tax, cutoff=80, processors=7)
worked fine. Output File Names: CNP.final.rdp.taxonomy CNP.final.rdp.tax.summary
then …
remove.lineage(taxonomy=CNP.final.rdp.taxonomy, name=CNP.final.names, group=CNP.final.groups, fasta=CNP.final.fasta, taxon=unclassified;-Bacteria;unclassified;, dups=T)
Seemed to work fine also. Output file names CNP.final.rdp.pick.taxonomy CNP.final.pick.names CNP.final.pick.fasta CNP.proovikaupa.pick.groups,
I renamed these files to format CNPU.final*, and continued with SOP. It worked fine until I had to use…
classify.otu(list=CNPU.final.an.list, name=CNPU.final.names, taxonomy=CNPU.final.taxonomy, label=0.03, cutoff=80)
It finished but with a lot of error messages …
X is not in your taxonomy file. I will not include it in the consensus.
X is represented by Y and is not in your taxonomy file. I will not include it in the consensus.
And then next…
make.shared(list=CNPU.final.tx.list, group=CNPU.final.groups, label=1)
Caused error messages - “[ERROR]: X is in your groupfile and not your listfile.” and “Please correct.Your group file contains 720345 sequences and list file contains 718362 sequences. Please correct. For a list of names that are in your group file and not in your list file, please refer to CNPU.final.tx.missing.name.”
It also seemed to freeze mothur. I had to kill the process.

Shouldn’t the remove.lineage command when done with all the fasta,names,group,taxonomy files options given keep all those files in synch?
Where did this imbalance between them creep in and how can I fix it?

It also seems weird that the wiki SOP would actually continue just as I did this time (as I mentioned on my first run I did no remove.lineage commands), should I be expecting these kind of trouble when I run SOP as written? When not then what might be the difference there?

PS. If needed I will post the entire (luckily) short run of commands (like a batch) in a post or in a link (maybe also link log-files), but I hope that maybe someone can answer and help me without me posting something so gangly up here.

Could you post the commands you ran from just before classify.seqs through classify.otu?

CN.final* files and CN.proovikaupa.groups were the final* files from an analysis I successfully ran before.

(In this analysis the group file was generated first out of .trim.fasta file by a house script (couldn’t use oligos file
because we had double barcode coding on both primers etc.)
As sequence names in our fasta file ended with _22A (Sample ID,number for parient, letter for treatment), then all this script had
to do was to remove this last 22A and write out a group file. Seq/t22A etc.)

For the next analysis I wanted to remove some groups and then these unclassified lineage.
I used following commands.

(The in house perls script ./gr_suureks converts a per-sample allocated group file, to a per-treatment allocated
group file by writing out sequence and just the ending letter of sampleID, seq/tA etc.)

remove.groups(fasta=CN.final.fasta, name=CN.final.names, group=CN.proovikaupa.groups, groups=11A-13A-11B-13B)
system(cp CN.final.pick.names CNP.final.names)
system(cp CN.final.pick.fasta CNP.final.fasta)
system(cp CN.proovikaupa.pick.groups CNP.proovikaupa.groups)
system(./gr_suureks CNP.proovikaupa.groups > CNP.final.groups)

After this I did the classify and remove.And converted the names to a shorter form.

classify.seqs(fasta=CNP.final.fasta, template=trainset6_032010.rdp.fasta, taxonomy=trainset6_032010.rdp.tax, cutoff=80, processors=7)

remove.lineage(taxonomy=CNP.final.rdp.taxonomy, name=CNP.final.names, group=CNP.final.groups, fasta=CNP.final.fasta, taxon=unclassified;-Bacteria;unclassified;, dups=T)

system(mv CNP.final.rdp.pick.taxonomy CNPU.final.taxonomy)
system(mv CNP.final.pick.names CNPU.final.names)
system(mv CNP.final.pick.fasta CNPU.final.fasta)
system(mv CNP.final.pick.groups CNPU.final.groups)

dist.seqs(fasta=CNPU.final.fasta, cutoff=0.15, processors=7)
cluster(column=CNPU.final.dist, name=CNPU.final.names)

make.shared(list=CNPU.final.an.list, group=CNPU.final.groups, label=0.03)
count.groups()

sub.sample(shared=CNPU.final.an.shared, size=321649)

classify.otu(list=CNPU.final.an.list, name=CNPU.final.names, taxonomy=CNPU.final.taxonomy, label=0.03, cutoff=80)

phylotype(taxonomy=CNPU.final.taxonomy, name=CNPU.final.names, label=1)
make.shared(list=CNPU.final.tx.list, group=CNPU.final.groups, label=1)
sub.sample(shared=CNPU.final.tx.shared, size=321649)
classify.otu(list=CNPU.final.tx.list, name=CNPU.final.names, taxonomy=CNPU.final.taxonomy, label=1)

(Wont write more of it right now. The troubles were in places I mentioned in previous post. I actually ran this two times - once I ran it
with a per-sample group file and this time in here for a per-treatment group file. The problematic commands were the same.

We found a bug in the remove.groups command that was causing file mismatch errors. It has been fixed in our current version. I suspect that is the source of the problem. I would suggest upgrading to 1.23.0 and rerunning from the remove.groups command on.

OK. Good to know. I’ll update my Mothur then. Thank you.

Ran an analysis with a new Mothur version. Seems there were some other issues than this probable Remove.groups issue. I will post the whole of my commands in the next post with comments/questions typed in and colored blue .

trim.seqs(fasta=CN.fa,qfile=CN.quala, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50, processors=7)
system(./grupp_CN.pl CN.trim.fasta > CN.groups)
unique.seqs(fasta=CN.trim.fasta)
align.seqs(fasta=CN.trim.unique.fasta, reference=gg.ref.fasta, processors=7,flip=T)
summary.seqs(fasta=CN.trim.unique.align)
screen.seqs(fasta=CN.trim.unique.align, name=CN.trim.names, group=CN.groups, end=4951,start=4655,minlength=65,processors=7)
summary.seqs(fasta=current)
filter.seqs(fasta=CN.trim.unique.good.align, vertical=T, trump=., processors=7)
unique.seqs(fasta=CN.trim.unique.good.filter.fasta, name=CN.trim.good.names)
pre.cluster(fasta=CN.trim.unique.good.filter.unique.fasta, name=CN.trim.unique.good.filter.names, group=CN.good.groups, diffs=1)
chimera.uchime(fasta=CN.trim.unique.good.filter.unique.precluster.fasta, name=CN.trim.unique.good.filter.unique.precluster.names, group=CN.good.groups, processors=7)
remove.seqs(accnos=CN.trim.unique.good.filter.unique.precluster.uchime.accnos, fasta=CN.trim.unique.good.filter.unique.precluster.fasta, name=CN.trim.unique.good.filter.unique.precluster.names, group=CN.good.groups)
system(mv CN.trim.unique.good.filter.unique.precluster.pick.names CN.final.names)
system(mv CN.trim.unique.good.filter.unique.precluster.pick.fasta CN.final.fasta)
system(mv CN.good.pick.groups CN.final.groups)
system(./gr_puhas CN.final.groups > CN.proovikaupa.groups)
remove.groups(fasta=CN.final.fasta, name=CN.final.names, group=CN.proovikaupa.groups, groups=11A-13A-11B-13B)
classify.seqs(fasta=CN.final.pick.fasta,template=gg_99.pds.ng.fasta, taxonomy=gg_99.pds.tax, cutoff=80, processors=7, method=knn)
remove.lineage(taxonomy=CN.final.pick.pds.taxonomy, name=CN.final.pick.names, group=CN.proovikaupa.pick.groups, fasta=CN.final.pick.fasta, taxon=k__Archaea;unclassified;-k__Bacteria;unclassified;, dups=T)
system(mv CN.final.pick.pick.names CNP.final.names)
system(mv CN.final.pick.pick.fasta CNP.final.fasta)
system(mv CN.proovikaupa.pick.pick.groups CNP.proovikaupa.groups)
system(mv CN.final.pick.pds.pick.taxonomy CNP.final.taxonomy)
system(./gr_suureks CNP.proovikaupa.groups > CNP.final.groups)
dist.seqs(fasta=CNP.final.fasta, cutoff=0.10, processors=7)
cluster(column=CNP.final.dist, name=CNP.final.names,method=furthest)
make.shared(list=CNP.final.fn.list, group=CNP.proovikaupa.groups, label=0.05)
count.groups()
sub.sample(shared=CNP.final.fn.shared, size=8326)
#everything fine and dandy to this point (only few notifications, at classify.seqs > There are no common levels for sequence #DBNW5DQ1:84:B04B8ABXX:5:1101:10800:19470:1:N:0:_80bp_80.0_0.86_CN_6A. #DBNW5DQ1:84:B04B8ABXX:5:1101:10800:19470:1:N:0:_80bp_80.0_0.86_CN_6A will be #disregarded.)
#Might this be the source of problems?
#Now the next command(classify) will produce these error messages.
#DBNW5DQ1:84:B04B8ABXX:5:1101:10800:19470:1:N:0:_80bp_80.0_0.86_CN_6A is not in your taxonomy file. I will not include it in the consensus.
#Warning: cannot find taxon no_consensus in reference taxonomy tree at level 0. This may cause totals of daughter levels not to add up in summary file.
#DBNW5DQ1:84:B04B8ABXX:5:1101:10800:19470:1:N:0:_80bp_80.0_0.86_CN_6A is represented by #DBNW5DQ1:84:B04B8ABXX:5:1101:9982:18199:1:N:0:_80bp_80.0_0.86_CN_6A #and is not in your taxonomy file. I will not include it in the consensus.
#so yeah this might mbe the source of my problems
classify.otu(list=CNP.final.fn.list, name=CNP.final.names, taxonomy=CNP.final.taxonomy, label=0.05, cutoff=80, reftaxonomy=gg_99.pds.tax)
#next command(make.shared(list=CNP.final.tx.list, group=CNP.proovikaupa.groups, label=1)) will cause error messages like
#[ERROR]: DBNW5DQ1:84:B04B8ABXX:5:2208:9429:200091:1:N:0:_80bp_80.0_0.86_CN_14B is in your groupfile and not your listfile. Please correct.
#Your group file contains 4021388 sequences and list file contains 4020773 sequences. Please correct.
#For a list of names that are in your group file and not in your list file, please refer to CNP.final.tx.missing.name.
#it also seems to hang up(mothur using processor but doesn’t seem to get anywhere for an hour or so, screen stays on the last lines of these notifications
#(please refer to CNP.final.tx.missing.name etc.) though no missing or other new files are created.)
#Is it the same problem caused by sequences that were disregarded at the classify.seqs command?
make.shared(list=CNP.final.tx.list, group=CNP.proovikaupa.groups, label=1)

No idea what these two lines do

system(./gr_puhas CN.final.groups > CN.proovikaupa.groups)
system(./gr_suureks CNP.proovikaupa.groups > CNP.final.groups)

But 2 lines after the second system call

make.shared(list=CNP.final.fn.list, group=CNP.proovikaupa.groups, label=0.05)

you use proovikaupa instead of final, is that correct?

jenz,

Sorry about this. It looks like you have some sequences that probably classify as across kingdoms and so when the method=knn goes for the consensus it is unable to make a consensus at the kingdom level and throws a hissy fit. We’re working on a software solution to this. Would you mind doing me a little favor? Since you seem to have some perl skills, could you modify the gg_99.pds.tax file by replacing all of the “\t” characters with “\tRoot;” and rerunning the classify.seqs command? I’m pretty sure this should solve the problem. If this works, I’ll make sure that all of the taxonomy files have a Root level that is the same across all sequences.

Pat

This seems to have solved the problem. I was able to complete sop without error messages/notifications or crashes. ( I have to still look inside the produced files and analyse results to see if there might me something amiss - but at the first sight its good.)
So I’ll guess I have to add this extra level into gg and rdp.tax files for future analyses.

Thanks for the update!