namefile and groupfile mismatch

Hi there,

I have been having issues with the namefile and groupfile not matching after a set of merge, screen, filter, and unique commands using mothur-1.23 - I’ve tried a couple different tests to figure out where the missing sequences went, but I’m at a loss right now… Here are the commands that I’ve used and the error that comes up:

merge.files(input=cfl25.subset.align-salm.subset.align, output=salm.all.subset.align)
merge.files(input=cfl25.subset.groups-salm.subset.groups, output=salm.all.subset.groups)
merge.files(input=cfl25.subset.names-salm.subset.names, output=salm.all.subset.names)
screen.seqs(fasta=salm.all.subset.align, name=salm.all.subset.names, group=salm.all.subset.groups, optimize=start-end, criteria=99, processors=24)
filter.seqs(fasta=current, vertical=T, trump=.,processors=24)
unique.seqs(fasta=current, name=current)
pre.cluster(fasta=current, name=current, group=current, diffs=2)

mothur > pre.cluster(fasta=salm.all.subset.good.filter.unique.fasta, name=salm.all.subset.good.filter.names, group=salm.all.subset.good.groups, diffs=2)

[ERROR]: Your name file contains 763839 valid sequences, and your groupfile contains 1162922, please correct.

Somewhere around the filter step, a bunch of sequences go missing. This is a pretty large dataset combining multiple plates. Please let me know if there is additional information I can send you guys. Thanks very much for your help!

Best,
Emiley

Does mothur output any missing names? Could any of your sequences have the same name? If you send your files to mothur.bugs@gmail.com, I can try and track down the problem for you.

It’s only after the make.shared command that it outputs a list of the missing sequences (it would be a nice function to have the same output of missing sequences in the pre.cluster command as the make.shared command…).

I don’t think that there are duplicate sequence names… at least not that I’ve been able to track down…

The files are really big - is there another way for me to send you the files? And which files specifically do you need - thanks so much for your help. I really really appreciate it!

Thanks,
Emiley

How did you create the original files?

Using wc and mothur’s count.seqs command, I can see that the names and groups files have the same number of sequences for both datasets.

wc -l salm.subset.groups
968999 salm.subset.groups
mothur > count.seqs(name=salm.subset.names)
Total number of sequences: 968999

wc -l cfl25.subset.groups
200824 cfl25.subset.groups
mothur > count.seqs(name=cfl25.subset.names)
Total number of sequences: 200824

But the fasta files are missing some unique sequences:

wc -l cfl25.subset.align
91658 cfl25.subset.align
wc -l cfl25.subset.names
49919 cfl25.subset.names

The fasta file should have at least twice the number of lines as the name file, but 49919*2 = 99838.

wc -l salm.subset.align
533912 salm.subset.align
wc -l salm.subset.names
272680 salm.subset.names

272680*2 = 545360.

I suspect somewhere in the analysis before these files were created you forgot to include the names and groups files on a command that removed sequences. If you post the commands you used to get to this point I may be able to spot the mistake.

Kindly,
Sarah

The problem stems from the get.groups command. You want to include the name file. Consider the following small example:

fasta file:

seq1
atgcatgc
seq2
tagataga

name file
seq1 seq1,seq3,seq4
seq2 seq2,seq5,seq6

group file
seq1 A
seq2 B
seq3 A
seq4 B
seq5 A
seq6 B

If you say get.groups(groups=A, group=groupfile, fasta=fastafile), you will get:
fasta file:

seq1
atgcatgc

group file
seq1 A
seq3 A
seq5 A

Sequence 2 is eliminated because it is not from groupA and we don’t know that it represents a sequence from group A.

But, if you say get.groups(groups=A, group=groupfile, fasta=fastafile, name=namefile), you will get:

fasta file:

seq1
atgcatgc
seq5
tagataga

name file
seq1 seq1,seq3
seq5 seq5

group file
seq1 A
seq3 A
seq5 A

I hope this helps, :slight_smile: