get.seqs() gets it wrong?

Hi,

I have question regarding my output from get.seqs().

Firstly, I modified my groups file (removing seqs) which I then used in the cmd list.seqs() to get a accnos file.

Then, I ran get.seqs() with that accnos file and my list file. Another time with the accnos file and my fasta file and another time with my names file.

My fasta and names file which were the input for the get.seqs() both contain the same amount of sequences (wc -l fasta = 121932, names = 60966), but comparing the output (.pick.names & .pick.fasta) from the get.seqs() cmd, they don’t match up anymore (wc -l .pick.fasta = 118094, .pick.names = 59222). The new .pick.names file have 175 sequences (lines) more than the .pick.fasta has sequences.

Also, when running summary.seqs using the .pick.fasta and .pick.names files the #unique seqs are the same as the .pick.fasta (121932/2 = 59047), but the total #seqs says 711851 which is more than the .accnos and modified groups file contains (wc -l = 702490). Using the get.seqs() on my list file creating a .pick.list also says that 702490 sequences was selected.

So, basically I don’t know why the pick.names file from the get.seqs() cmd contain more sequences (lines) than the other pick.files?

Thanks,
J

Could you send your input files and logfile to mothur.bugs@gmail.com?

Sure thing! Done!

Here’s the answer I received from Sarah. Thanks! :slight_smile:

Hi Johannes,
The get.seqs command has a parameter called dups, default=t, meaning if you to add any name from a line in the names file add the entire line. If you want to select only the names in the accnos file, set dups=f.
Kindly,
Sarah Westcott

Thanks for the reply.

Using dups=f when running get.seqs() on my names file seems to solve the problem partly.

get.seqs(accnos=new.final.accnos, name=final.names,dups=f)

Selected 702490 sequences from your name file.

Output File Names: 
final.pick.names

…Which is the right amount of sequences (same as .accnos file). Before, using the default dups=t, it selected 713534 sequences (see original comment).

Although, counting lines in the final.pick.names, it still counts to 59222 which is the same number as before and do not match the final.pick.fasta file (wc -l = 118094/2 = 59047).

I still don’t get why this is, since the input fasta and names files match in their number of lines (wc -l fasta = 121932/2 =60966; names = 60966).
I see on the get.seqs() page that this bug occurred in the previous version. Maybe it’s still alive?

Thanks,
J

When you are selecting names from a names and fasta file you must run them in the same command, because if the “unique” name is not selected mothur will need to change the name in the fasta file as well. Here’s an example:

fasta file:

seq1
atgcatgcatgc
seq2
tgactgactgac

names file:
seq1 seq1,seq3,seq5,seq7
seq2 seq2,seq4

accnos:
seq3
seq2

get.seqs(fasta=yourFastaFile, name=yourNameFile, accnos=yourAccnos, dups=f)

fasta file:

seq3
atgcatgcatgc
seq2
tgactgactgac

names file:
seq3 seq3
seq2 seq2

get.seqs(name=yourNameFile, accnos=yourAccnos, dups=f) - creates a file mismatch because the fasta file doesn’t know seq1 and seq3 are identical.

fasta file:

seq1
atgcatgcatgc
seq2
tgactgactgac

names file:
seq3 seq3
seq2 seq2

Thanks for clearing things up! Although, follow-up question below.

So I ran…

mothur > get.seqs(accnos=new.final.accnos, list=final.phylip.fn.list, fasta=final.fasta, name=final.names, dups=f)

Selected 702490 sequences from your name file.
Selected 59222 sequences from your fasta file.
Selected 702490 sequences from your list file.

Output File Names: 
final.pick.names
final.pick.fasta
final.phylip.fn.pick.list

mothur > system(wc -l final.pick.names)

59222 final.pick.names

mothur > system(wc -l final.pick.fasta)

118444 final.pick.fasta

…Which seems to be correct.

However, when running summary.seqs with the new .pick.fasta and .pick.names file I got a bit of a confusing output.

mothur > summary.seqs(fasta=final.pick.fasta, name=final.pick.names, processors=5)


Using 5 processors.
[ERROR]: 'HX1JDSX01EHJLW' is not in your name or count file, please correct.
[ERROR]: 'HX1JDSX01DG9F8' is not in your name or count file, please correct.
[ERROR]: 'HXXS1YU02JAP0C' is not in your name or count file, please correct.

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 933 210 0 3 1
2.5%-tile: 1 935 225 0 3 17000
25%-tile: 1 935 241 0 4 169995
Median:  1 935 256 0 5 339990
75%-tile: 1 935 266 0 5 509985
97.5%-tile: 1 935 276 0 6 662980
Maximum: 1 935 302 0 8 679979
Mean: 1 935 254.157 0 4.71496
# of unique seqs: 23692
total # of seqs: 679979

Output File Names: 
final.pick.summary

I don’t know what’s going on here. Why doesn’t this output agree with the previous? Why are there 3 mismatches here, in addition to the total/unique count being totally off?

Many thanks,
J

Thanks for helping us find this issue. Mothur was not properly renaming the sequences in the fasta file to reflect the changes to the names file. I will release a 1.31.1 version later today that resolves this issue. Remove.seqs handles the renaming properly, so if you would like you could use remove.seqs to create the files until the release is ready.

Happy to be of any help! Thanks for fixing it and providing an awesome service!