get.lineage()....missing names?

verenastarke · November 18, 2010, 5:02pm

Hi there,

I would like to select a subset of sequences from my data set. However, I am having problems with the name-files. I am not able to get the right set of sequences (number-wise)…either they are missing or there are too many. HELP!

I am not sure what I am doing wrong or how I could do it differently. I would be grateful for any help or input.

Details for what I did are below.

Cheers,
V

DETAILS:
starting files
VS.final.fasta: 18418 unique sequences
VS.final.names: 95342 sequences
VS.final.groups: 95342 sequences

of the 95342 sequences:
37763 sequences belong to cyanobacteria with 3200 unique sequences
of the 37763 sequences:
15110 sequences belong to a subgroup of cyanobacteria with 1498 unique sequences, which I am interested in.

I used get.lineage to select for cyanobacteria:

mothur > get.lineage(taxonomy=VS.final.rdp6.taxonomy, taxon=Bacteria;Cyanobacteria;, group=VS.final.groups, name=VS.final.names, fasta=VS.final.fasta)

the newly generated files were renamed to “VS.cyano.*”.
VS.cyano.fasta: 3200 unique sequences
VS.cyano.name: 3200 sequences
VS.cyano.group: 3200 sequences
Shouldn’t the group and name file contain 37763 sequences?
Next I tested the files by using classify.seqs

mothur > classify.seqs(fasta=VS.cyano.fasta, group=VS.cyano.groups, name=VS.cyano.names, taxonomy=silva.bacteria.rdp6.tax, template=nogap.bacteria.fasta)

The VS.cyano.rdp6.tax.summary-file only contained classifications for 3200 cyanobacteria sequences and not for 37763 sequences.
So I re-ran the the classification with the original name-file and group-file (that contains the complete set of the 95342 sequences):

mothur > classify.seqs(fasta=VS.cyano.fasta, group=VS.final.groups, name=VS.final.names, taxonomy=silva.bacteria.rdp6.tax, template=nogap.bacteria.fasta)

The VS.cyano.rdp6.tax.summary-file contained classifications for 37763 sequences.
Then I removed a lineage from the cyanobacteria-set:

mothur > remove.lineage(fasta=VS.cyano.fasta, taxonomy=VS.cyano_classif2.rdp6.taxonomy, taxon=Bacteria;Cyanobacteria;Cyanobacteria;Chloroplast;, group=VS.cyano.groups)

the newly generated files were renamed to “VS.cyano_noChloro.*”.
VS.cyano_noChloro.fasta: 1498 unique sequences
VS.cyano_noChloro.groups: 1498 sequences
If running the command with the name-option then the name-file only contains 1498 sequences as well.
I only was able to get 15110 classified sequences when I used the original name and group-file (that contains the complete set of the 95342 sequences…as above).

I didn’t use the name-file, that was generated by get.lineages, for further analysis because didn’t have all the names in it. So I tried to analyze the newly generated fasta file with the name-file for the complete sequence set

mothur > dist.seqs(fasta=VS.cyano_noChloro.fasta, output=lt)

It took 8 to calculate the distances for 1498 sequences.

mothur > read.dist(cutoff=0.20, name=VS.final.names, phylip=VS.cyano_noChloro.phylip.dist)

mothur > cluster(method=average)

unique 12093 13766 1838 773 431 251 197 …

Why does it cluster all sequence and not just my selected ones (as there are in the matrix)?
The generated list-file contains the complete data-set (95342 sequences) not just the subset. I checked the “VS.cyano-noChloro.fasta”-file and it does only contain 1498 sequences.

It seems like that without the right name-file it is not possible.

I tried to continue, but it wasn’t possible.

mothur > read.otu(group=VS.cyano_noChloro.groups, label=unique-0.03-0.05-0.10-0.20, list=VS.cyano_noChloro.phylip.an.list)

Your group file contains 1498 sequences and list file contains 95342 sequences. Please correct.
For a list of names that are in your list file and not in your group file, please refer to VS.cyano_noChloro.phylip2.an.missing.group.

…

Then I tried to run it without the name-option:

mothur > read.dist(cutoff=0.20, phylip=VS.cyano_noChloro.phylip.dist)

mothur > cluster(method=average)

unique 1 1498
0.01 11 1092 152…

mothur > read.otu(group=VS.cyano_noChloro.groups, label=unique-0.03-0.05-0.10-0.20, list=VS.cyano_noChloro.phylip.an.list)

The VS.cyano_noChloro.phylip.an.shared-file only contains 1498 sequences and not 15110.

ANy ideas on what’s going on?

westcott · November 19, 2010, 11:29am

The problems you are having from the dist.seqs command on stem from the names and groupfile problems. For the get.lineage and remove.lineage commands the dups parameter defaults to false. Try running both with dups=T.

verenastarke · November 19, 2010, 5:04pm

Hi there,

Thanks, that helped! I appreciate it.
I got confused because the manual says that “By default dups is true”.

Cheers,
V

Topic		Replies	Views
Help with get.lineage Commands in mothur	4	832	April 15, 2019
Get.lineage bug mothur bugs	1	3341	February 3, 2011
bin.seqs subset Commands in mothur	3	2110	March 12, 2015
a problem about get.lineage Commands in mothur	2	3765	October 31, 2012
get.otus using a list Commands in mothur	3	3136	May 29, 2014

get.lineage()....missing names?

Related topics