get.lineage()....missing names?

Hi there,

I would like to select a subset of sequences from my data set. However, I am having problems with the name-files. I am not able to get the right set of sequences (number-wise)…either they are missing or there are too many. HELP!

I am not sure what I am doing wrong or how I could do it differently. I would be grateful for any help or input.

Details for what I did are below.

Cheers,
V

DETAILS:
starting files
VS.final.fasta: 18418 unique sequences
VS.final.names: 95342 sequences
VS.final.groups: 95342 sequences

  • of the 95342 sequences:
    37763 sequences belong to cyanobacteria with 3200 unique sequences
  • of the 37763 sequences:
    15110 sequences belong to a subgroup of cyanobacteria with 1498 unique sequences, which I am interested in.

I used get.lineage to select for cyanobacteria:

mothur > get.lineage(taxonomy=VS.final.rdp6.taxonomy, taxon=Bacteria;Cyanobacteria;, group=VS.final.groups, name=VS.final.names, fasta=VS.final.fasta)

the newly generated files were renamed to “VS.cyano.*”.
VS.cyano.fasta: 3200 unique sequences
VS.cyano.name: 3200 sequences
VS.cyano.group: 3200 sequences
Shouldn’t the group and name file contain 37763 sequences?
Next I tested the files by using classify.seqs

mothur > classify.seqs(fasta=VS.cyano.fasta, group=VS.cyano.groups, name=VS.cyano.names, taxonomy=silva.bacteria.rdp6.tax, template=nogap.bacteria.fasta)

The VS.cyano.rdp6.tax.summary-file only contained classifications for 3200 cyanobacteria sequences and not for 37763 sequences.
So I re-ran the the classification with the original name-file and group-file (that contains the complete set of the 95342 sequences):

mothur > classify.seqs(fasta=VS.cyano.fasta, group=VS.final.groups, name=VS.final.names, taxonomy=silva.bacteria.rdp6.tax, template=nogap.bacteria.fasta)

The VS.cyano.rdp6.tax.summary-file contained classifications for 37763 sequences.
Then I removed a lineage from the cyanobacteria-set:

mothur > remove.lineage(fasta=VS.cyano.fasta, taxonomy=VS.cyano_classif2.rdp6.taxonomy, taxon=Bacteria;Cyanobacteria;Cyanobacteria;Chloroplast;, group=VS.cyano.groups)

the newly generated files were renamed to “VS.cyano_noChloro.*”.
VS.cyano_noChloro.fasta: 1498 unique sequences
VS.cyano_noChloro.groups: 1498 sequences
If running the command with the name-option then the name-file only contains 1498 sequences as well.
I only was able to get 15110 classified sequences when I used the original name and group-file (that contains the complete set of the 95342 sequences…as above).

I didn’t use the name-file, that was generated by get.lineages, for further analysis because didn’t have all the names in it. So I tried to analyze the newly generated fasta file with the name-file for the complete sequence set

mothur > dist.seqs(fasta=VS.cyano_noChloro.fasta, output=lt)

It took 8 to calculate the distances for 1498 sequences.

mothur > read.dist(cutoff=0.20, name=VS.final.names, phylip=VS.cyano_noChloro.phylip.dist)

mothur > cluster(method=average)

unique 12093 13766 1838 773 431 251 197 …

Why does it cluster all sequence and not just my selected ones (as there are in the matrix)?
The generated list-file contains the complete data-set (95342 sequences) not just the subset. I checked the “VS.cyano-noChloro.fasta”-file and it does only contain 1498 sequences.

It seems like that without the right name-file it is not possible.

I tried to continue, but it wasn’t possible.

mothur > read.otu(group=VS.cyano_noChloro.groups, label=unique-0.03-0.05-0.10-0.20, list=VS.cyano_noChloro.phylip.an.list)

Your group file contains 1498 sequences and list file contains 95342 sequences. Please correct.
For a list of names that are in your list file and not in your group file, please refer to VS.cyano_noChloro.phylip2.an.missing.group.

Then I tried to run it without the name-option:

mothur > read.dist(cutoff=0.20, phylip=VS.cyano_noChloro.phylip.dist)

mothur > cluster(method=average)

unique 1 1498
0.01 11 1092 152…

mothur > read.otu(group=VS.cyano_noChloro.groups, label=unique-0.03-0.05-0.10-0.20, list=VS.cyano_noChloro.phylip.an.list)

The VS.cyano_noChloro.phylip.an.shared-file only contains 1498 sequences and not 15110.

ANy ideas on what’s going on?

The problems you are having from the dist.seqs command on stem from the names and groupfile problems. For the get.lineage and remove.lineage commands the dups parameter defaults to false. Try running both with dups=T.

Hi there,

Thanks, that helped! I appreciate it.
I got confused because the manual says that “By default dups is true”.

Cheers,
V