Get.seqs returning different numbers


I tried running list.seqs followed by get.seqs because my groups file contains more sequences than my list file, which is stopping me from moving on to make.shared. I ran the following commands:

mothur > list.seqs(

Output File Name:

mothur > get.seqs(,,, Selected 33322 sequences from your name file. Selected 33322 sequences from your group file. Selected 27124 sequences from your list file.

Output File Names:

And as you can see, my groups file is still larger than my list file. Has anyone had this problem before? Do you have any suggestions for how I can fix it?

The get.seqs command has a dups parameter that defaults to true. dups=t will also select all the redundant names from the names file for the seqs listed in the accnos file. Rather than removing names from the names and group file, I would like to help you figure out how the discrepancy happened. Did you forget to include the names file on the cluster command?

Thanks! I used dups=f and it seemed to solve the problem.
I have been using cluster with out the names option:

dist.seqs(, cutoff=0.15, processors=2, output=square)

I have been using these commands for a few months and only starting having problems when I updated to version 1.28. I can start adding the names file to the cluster command, but would that help with the problems with the groups file too?

It would resolve the problems you are having with the groups file. It is also important to include the names file when you cluster because mothur uses the number of sequences in each OTU while clustering with the average neighbor method. Here’s an example:

seq2 0.01
seq3 0.015 0.02
seq4 0.03 0.04 0.04
seq5 0.05 0.06 0.07 0.017

seq1 represents 10 seqs in names file.
seq2 represents 1 seq in names file.
seq3 represents 5 seqs in names file.
seq4 represents 2 seqs in names file.
seq5 represents 30 seqs in names file.

The first cluster mothur will make is seq1 and seq2. Based on average neighbor newDist = (numSeqs1 * dist1 + numSeqs2 * dist2) / (numSeqs1+numSeqs2); The merged distance matrix becomes:

new12,3DIst = (10 * 0.015 + 1 * 0.02) / 11 = 0.0154 without a names file it would be (10.015 + 10.02) / 2 = 0.0175.
new12,4DIst = (10 * 0.03 + 1 * 0.04) / 11 = 0.0309 without a names file it would be (10.03 + 10.04) / 2 = 0.035.
new12,5DIst = (10 * 0.05 + 1 * 0.06) / 11 = 0.0509 without a names file it would be (10.05 + 10.06) / 2 = 0.055.

with names file:
seq3 0.0154
seq4 0.0309 0.04
seq5 0.0509 0.07 0.017

seq1,seq2 seq3 seq4 seq5

without names file:

seq3 0.0175
seq4 0.035 0.04
seq5 0.055 0.07 0.017

seq1,seq2 seq3 seq4 seq5

Now we can see how the names file will effect the clustering. With it, the next smallest distance is 0.0154, 1,2 with 3. Without it, the next smallest distance is 0.017 4 with 5.

seq1,seq2,seq3 seq4 seq5 in the list file instead of seq1,seq2 seq3 seq4,seq5