split.groups gives more sequences than are in the original

Hi!

I have used mothur v.1.22.2

I split my file like this:

mothur > split.groups(fasta=GDEG1CX02.shhh.trim.unique.pick.chop.fasta, group=GDEG1CX02.shhh.pick.groups, name=GDEG1CX02.shhh.trim.pick.names)

This file contains

% grep -c “>” GDEG1CX02.shhh.trim.unique.pick.chop.fasta
850

sequences.
Now, when I count the number of sequences in the resulting files I get
% grep -c “>” GDEG1CX02.shhh.trim.unique.pick.chop.R*.fasta |awk -F":" ‘{SUM += $2} END {print SUM}’
912

sequences.

What happens here?


Karin

That seems odd. Could you send your files to mothur.bugs@gmail.com?

Just sent them now.

Thanks!

Hi Karin,

The total across all groups can be higher, because of the name file. Lets look at an example:

From the names file:
seq1 seq1,seq2,seq3

From the fasta file:

seq1
ATGCATGA…

From the group file:
seq1 Group1
seq2 Group2
seq3 Group1

When mothur splits by group, it will create a new names and fasta file for each group.

*.Group1.fasta

seq1
ATGCATGA…

*.Group1.names
seq1 seq1,seq3

*.Group2.fasta

seq2
ATGCATGA…

*.Group2.names
seq2 seq2

The one unique sequence represents sequences from multiple groups, so each group gets a copy. Does that make sense?

Kindly,
Sarah