I am a mothur newbie and a bit confused about my results.
Before analysing my processed data I wanted to separate it into 22 samples and therefore split up my groups file into 22 new groups files and ran list.seqs with each groups file to get corresponding accnos files.
I than ran for each sample:
mothur > get.seqs(accnos=A1_V.accnos, fasta=AmpIfinal.fasta)
Selected 533 sequences from your fasta file.
Output File Names:
AmpIfinal.pick.fasta
The total number of sequences selected for each sample added up to the total number of unique sequences.
Then I ran:
mothur > get.seqs(accnos=A1_V.accnos, name=AmpIfinal.names, fasta=AmpIfinal.fasta, group=A1_V.groups, dups=F)
Selected 2289 sequences from your name file.
Selected 611 sequences from your fasta file.
Selected 2289 sequences from your group file.
The number of sequences selected from my names files and groups files add up to the total amount of sequences but the number of sequences selected from the fasta files are higher than the number of unique sequences.
Can anyone please explain this discrepancy.
Should I keep the 22 new names files and groups files and run get.seqs for the fasta files separately or will I run into problems later if get.seqs is not run simultaneously for the names, fasta and groups files?
Welcome to the mothur community! What version of mothur are you using? The list.seqs and get.seqs should do what you are trying to do, but it may be easier to do with the split.groups command, http://www.mothur.org/wiki/Split.groups.
For get.seqs command, dups=t by default. This means that if a unique sequence is selected, then all the redundant sequences for that sequence are selected. From looking at mothur’s outputs I suspect you have some sequences in the fasta file that are listed in column 2 of the names file. Mothur assumes a “unique” fasta file contains only sequences from column 1 of the names file. How did you create these fasta and names files?
The fasta and names files are outputs of sequence processing according to the 454 SOP.
What bothers me is that
mothur > get.seqs(accnos=A1_V.accnos, fasta=AmpIfinal.fasta) results in 533 sequences while
mothur > get.seqs(accnos=A1_V.accnos, name=AmpIfinal.names, fasta=AmpIfinal.fasta, group=A1_V.groups, dups=F) results in 611.
Also, that the latter command results in 533 and 611 sequences in v.1.24. an 1.32, respectively.
But, I am a newbie - maybe I should just switch to v. 1.24 :?
I would not recommend switching to version 1.24. Our latest version contains many new features, updates and bug fixes. If you send your files to mothur.bugs@gmail.com I can track down the exact cause of the discrepancy and help you resolve the issue.
The accnos file contains 2289 sequence names. Some are unique and some are redundant because the accnos file was created from a groups file. It can get confusing this way, ideally you want the accnos file to contain the unique names, because mothur is smart enough to handle the names file. Let me give you an example of what’s happening with one of the 78 sequences that’s being selected with the names file option.
From the names file:
H53OP4K01ALX9M H53OP4K01ALX9M,H53OP4K01ALBPZ,H53OP4K01ANM0S,H53OP4K01AI82C,H53OP4K01AHG1T,H53OP4K01ARODS,H53OP4K01ALANE,H53OP4K03B0Y69,H53OP4K03BUA4S
H53OP4K03B0Y69 is in the accnos file. When you run get.seqs with just the fasta file, mothur does not select sequence H53OP4K01ALX9M because it does not know that H53OP4K01ALX9M represents H53OP4K03B0Y69. But when you run get.seqs with the names file mothur does select H53OP4K01ALX9M because it makes the connection between H53OP4K01ALX9M and H53OP4K03B0Y69.