namefile and groupfile mismatch

eeloe · February 19, 2012, 3:23pm

Hi there,

I have been having issues with the namefile and groupfile not matching after a set of merge, screen, filter, and unique commands using mothur-1.23 - I’ve tried a couple different tests to figure out where the missing sequences went, but I’m at a loss right now… Here are the commands that I’ve used and the error that comes up:

merge.files(input=cfl25.subset.align-salm.subset.align, output=salm.all.subset.align)
merge.files(input=cfl25.subset.groups-salm.subset.groups, output=salm.all.subset.groups)
merge.files(input=cfl25.subset.names-salm.subset.names, output=salm.all.subset.names)
screen.seqs(fasta=salm.all.subset.align, name=salm.all.subset.names, group=salm.all.subset.groups, optimize=start-end, criteria=99, processors=24)
filter.seqs(fasta=current, vertical=T, trump=.,processors=24)
unique.seqs(fasta=current, name=current)
pre.cluster(fasta=current, name=current, group=current, diffs=2)

mothur > pre.cluster(fasta=salm.all.subset.good.filter.unique.fasta, name=salm.all.subset.good.filter.names, group=salm.all.subset.good.groups, diffs=2)

[ERROR]: Your name file contains 763839 valid sequences, and your groupfile contains 1162922, please correct.

Somewhere around the filter step, a bunch of sequences go missing. This is a pretty large dataset combining multiple plates. Please let me know if there is additional information I can send you guys. Thanks very much for your help!

Best,
Emiley

westcott · February 20, 2012, 1:05pm

Does mothur output any missing names? Could any of your sequences have the same name? If you send your files to mothur.bugs@gmail.com, I can try and track down the problem for you.

eeloe · February 20, 2012, 2:16pm

It’s only after the make.shared command that it outputs a list of the missing sequences (it would be a nice function to have the same output of missing sequences in the pre.cluster command as the make.shared command…).

I don’t think that there are duplicate sequence names… at least not that I’ve been able to track down…

The files are really big - is there another way for me to send you the files? And which files specifically do you need - thanks so much for your help. I really really appreciate it!

Thanks,
Emiley

westcott · February 20, 2012, 6:33pm

How did you create the original files?

Using wc and mothur’s count.seqs command, I can see that the names and groups files have the same number of sequences for both datasets.

wc -l salm.subset.groups
968999 salm.subset.groups
mothur > count.seqs(name=salm.subset.names)
Total number of sequences: 968999

wc -l cfl25.subset.groups
200824 cfl25.subset.groups
mothur > count.seqs(name=cfl25.subset.names)
Total number of sequences: 200824

But the fasta files are missing some unique sequences:

wc -l cfl25.subset.align
91658 cfl25.subset.align
wc -l cfl25.subset.names
49919 cfl25.subset.names

The fasta file should have at least twice the number of lines as the name file, but 49919*2 = 99838.

wc -l salm.subset.align
533912 salm.subset.align
wc -l salm.subset.names
272680 salm.subset.names

272680*2 = 545360.

I suspect somewhere in the analysis before these files were created you forgot to include the names and groups files on a command that removed sequences. If you post the commands you used to get to this point I may be able to spot the mistake.

Kindly,
Sarah

westcott · February 20, 2012, 7:33pm

The problem stems from the get.groups command. You want to include the name file. Consider the following small example:

fasta file:

seq1
atgcatgc
seq2
tagataga

name file
seq1 seq1,seq3,seq4
seq2 seq2,seq5,seq6

group file
seq1 A
seq2 B
seq3 A
seq4 B
seq5 A
seq6 B

If you say get.groups(groups=A, group=groupfile, fasta=fastafile), you will get:
fasta file:

seq1
atgcatgc

group file
seq1 A
seq3 A
seq5 A

Sequence 2 is eliminated because it is not from groupA and we don’t know that it represents a sequence from group A.

But, if you say get.groups(groups=A, group=groupfile, fasta=fastafile, name=namefile), you will get:

fasta file:

seq1
atgcatgc
seq5
tagataga

name file
seq1 seq1,seq3
seq5 seq5

group file
seq1 A
seq3 A
seq5 A

I hope this helps,

Topic		Replies	Views
more sequences in groupfile than in name file mothur bugs	4	4137	July 13, 2012
Name file and group file sequence discrepancy Commands in mothur	5	3850	May 29, 2013
missing.group Commands in mothur	5	41099	January 29, 2010
no equal numbers of sequences between name and group file mothur bugs	6	6866	May 5, 2012
groupfile has more valid sequences in it than my namefile mothur bugs	7	11378	October 24, 2012

namefile and groupfile mismatch

Related topics