I’m trying to work with just the abundant OTUs for the moment. I filtered my shared by #of seqs which has reduced it down to <6000 OTUs which I’d now like to analyze with unifraq. I need to create a fasta with just the representative seqs for just those OTUs which I can run dist.seqs on, then build the tree. What would be the most straight forward way to do that? I’ve been trying to write a gawk oneliner that uses an accnos file as keys but don’t know enough gawk to get the line with the sequence, just the one with the OTU number.
We are adding a modification to mothur in the next release to help with situations like this. In the next release the list file will include OTU labels. That will allow you to run commands like:
list.otulabels(shared=yourFilteredSharedFile) - list the otu labels in your filtered shared file
get.otuLabels(list=yourListFile) - select otus from list file. Assumes you are using files that relate to eachother.
list.seqs(list=yourListFile) - list sequences from list file
get.dists(phylip=yourDistanceMatrix) - get distances related to abundant otus you selected
get.seqs(fasta=yourFastaFile, name=yourNamesFile) - get abundant sequences
get.oturep(phylip=yourDistanceMatrix, list=yourListFile, fasta=yourFastaFile, name=yourNameFile) - find reprensentative sequences
Without this modification, you will have to select the OTUs from the list file manually. Mothur does sort the list file for you by abundance, so if you selected the top 6000 most abundant OTUs, then that would be the first 6000 OTUs listed in the list file.
Thanks. I’m trying to make a database then gawking that. My otus aren’t sorted by abundance but maybe that’s because I used the cluster.split option-they’re ordered by phyla
Could you use the split.abund command with the list file to get the otus you want and then build the shared file? Then you could use the list file with get.seqs?
Hey Sarah, I’m not sure I’m following this. I have the filtered shared file (w/ 5578 otus), representative otu fasta for all ~300k otus, list file for all ~300k otus. so split.abund I’d use the representative fasta, full list, and cutoff=100 (which is what I used to filter the shared file)?
which seems like it worked except I get 5642 sequences instead of 5578 which is how many otus I get when I filter shared with mintotal=100. so close but not exactly the same, any idea why?
I took a closer look for you. In the split.abund command, we use <= to send an OTU to *.rare. In the filter.shared command, if an OTU is < the cutoff it is removed.
the way i did this was by making a fasta and names file of the just the representative sequences : get.oturep(phylip=, list=,fasta=)
then I opened up the fasta file- voila, ordered by abundance of OTU