I have been having issues with the namefile and groupfile not matching after a set of merge, screen, filter, and unique commands using mothur-1.23 - I’ve tried a couple different tests to figure out where the missing sequences went, but I’m at a loss right now… Here are the commands that I’ve used and the error that comes up:
screen.seqs(fasta=salm.all.subset.align, name=salm.all.subset.names, group=salm.all.subset.groups, optimize=start-end, criteria=99, processors=24)
filter.seqs(fasta=current, vertical=T, trump=.,processors=24)
pre.cluster(fasta=current, name=current, group=current, diffs=2)
mothur > pre.cluster(fasta=salm.all.subset.good.filter.unique.fasta, name=salm.all.subset.good.filter.names, group=salm.all.subset.good.groups, diffs=2)
[ERROR]: Your name file contains 763839 valid sequences, and your groupfile contains 1162922, please correct.
Somewhere around the filter step, a bunch of sequences go missing. This is a pretty large dataset combining multiple plates. Please let me know if there is additional information I can send you guys. Thanks very much for your help!
Does mothur output any missing names? Could any of your sequences have the same name? If you send your files to firstname.lastname@example.org, I can try and track down the problem for you.
It’s only after the make.shared command that it outputs a list of the missing sequences (it would be a nice function to have the same output of missing sequences in the pre.cluster command as the make.shared command…).
I don’t think that there are duplicate sequence names… at least not that I’ve been able to track down…
The files are really big - is there another way for me to send you the files? And which files specifically do you need - thanks so much for your help. I really really appreciate it!
How did you create the original files?
Using wc and mothur’s count.seqs command, I can see that the names and groups files have the same number of sequences for both datasets.
wc -l salm.subset.groups
mothur > count.seqs(name=salm.subset.names)
Total number of sequences: 968999
wc -l cfl25.subset.groups
mothur > count.seqs(name=cfl25.subset.names)
Total number of sequences: 200824
But the fasta files are missing some unique sequences:
wc -l cfl25.subset.align
wc -l cfl25.subset.names
The fasta file should have at least twice the number of lines as the name file, but 49919*2 = 99838.
wc -l salm.subset.align
wc -l salm.subset.names
272680*2 = 545360.
I suspect somewhere in the analysis before these files were created you forgot to include the names and groups files on a command that removed sequences. If you post the commands you used to get to this point I may be able to spot the mistake.
The problem stems from the get.groups command. You want to include the name file. Consider the following small example:
If you say get.groups(groups=A, group=groupfile, fasta=fastafile), you will get:
Sequence 2 is eliminated because it is not from groupA and we don’t know that it represents a sequence from group A.
But, if you say get.groups(groups=A, group=groupfile, fasta=fastafile, name=namefile), you will get:
I hope this helps,