Hi there,
I would like to select a subset of sequences from my data set. However, I am having problems with the name-files. I am not able to get the right set of sequences (number-wise)…either they are missing or there are too many. HELP!
I am not sure what I am doing wrong or how I could do it differently. I would be grateful for any help or input.
Details for what I did are below.
Cheers,
V
DETAILS:
starting files
VS.final.fasta: 18418 unique sequences
VS.final.names: 95342 sequences
VS.final.groups: 95342 sequences
- of the 95342 sequences:
37763 sequences belong to cyanobacteria with 3200 unique sequences - of the 37763 sequences:
15110 sequences belong to a subgroup of cyanobacteria with 1498 unique sequences, which I am interested in.
I used get.lineage to select for cyanobacteria:
mothur > get.lineage(taxonomy=VS.final.rdp6.taxonomy, taxon=Bacteria;Cyanobacteria;, group=VS.final.groups, name=VS.final.names, fasta=VS.final.fasta)
the newly generated files were renamed to “VS.cyano.*”.
VS.cyano.fasta: 3200 unique sequences
VS.cyano.name: 3200 sequences
VS.cyano.group: 3200 sequences
Shouldn’t the group and name file contain 37763 sequences?
Next I tested the files by using classify.seqs
mothur > classify.seqs(fasta=VS.cyano.fasta, group=VS.cyano.groups, name=VS.cyano.names, taxonomy=silva.bacteria.rdp6.tax, template=nogap.bacteria.fasta)
The VS.cyano.rdp6.tax.summary-file only contained classifications for 3200 cyanobacteria sequences and not for 37763 sequences.
So I re-ran the the classification with the original name-file and group-file (that contains the complete set of the 95342 sequences):
mothur > classify.seqs(fasta=VS.cyano.fasta, group=VS.final.groups, name=VS.final.names, taxonomy=silva.bacteria.rdp6.tax, template=nogap.bacteria.fasta)
The VS.cyano.rdp6.tax.summary-file contained classifications for 37763 sequences.
Then I removed a lineage from the cyanobacteria-set:
mothur > remove.lineage(fasta=VS.cyano.fasta, taxonomy=VS.cyano_classif2.rdp6.taxonomy, taxon=Bacteria;Cyanobacteria;Cyanobacteria;Chloroplast;, group=VS.cyano.groups)
the newly generated files were renamed to “VS.cyano_noChloro.*”.
VS.cyano_noChloro.fasta: 1498 unique sequences
VS.cyano_noChloro.groups: 1498 sequences
If running the command with the name-option then the name-file only contains 1498 sequences as well.
I only was able to get 15110 classified sequences when I used the original name and group-file (that contains the complete set of the 95342 sequences…as above).
I didn’t use the name-file, that was generated by get.lineages, for further analysis because didn’t have all the names in it. So I tried to analyze the newly generated fasta file with the name-file for the complete sequence set
mothur > dist.seqs(fasta=VS.cyano_noChloro.fasta, output=lt)
It took 8 to calculate the distances for 1498 sequences.
mothur > read.dist(cutoff=0.20, name=VS.final.names, phylip=VS.cyano_noChloro.phylip.dist)
mothur > cluster(method=average)
unique 12093 13766 1838 773 431 251 197 …
Why does it cluster all sequence and not just my selected ones (as there are in the matrix)?
The generated list-file contains the complete data-set (95342 sequences) not just the subset. I checked the “VS.cyano-noChloro.fasta”-file and it does only contain 1498 sequences.
It seems like that without the right name-file it is not possible.
I tried to continue, but it wasn’t possible.
mothur > read.otu(group=VS.cyano_noChloro.groups, label=unique-0.03-0.05-0.10-0.20, list=VS.cyano_noChloro.phylip.an.list)
Your group file contains 1498 sequences and list file contains 95342 sequences. Please correct.
For a list of names that are in your list file and not in your group file, please refer to VS.cyano_noChloro.phylip2.an.missing.group.
…
Then I tried to run it without the name-option:
mothur > read.dist(cutoff=0.20, phylip=VS.cyano_noChloro.phylip.dist)
mothur > cluster(method=average)
unique 1 1498
0.01 11 1092 152…
mothur > read.otu(group=VS.cyano_noChloro.groups, label=unique-0.03-0.05-0.10-0.20, list=VS.cyano_noChloro.phylip.an.list)
The VS.cyano_noChloro.phylip.an.shared-file only contains 1498 sequences and not 15110.
ANy ideas on what’s going on?