I have a bit of a problem - if I analyzed my 10 16S samples individually, I could easily (after dist.seqs and cluster) run the get.oturep command to get back a FASTA file containing one representative sequence from each OTU to afterwards analyze as I please. For example, sample A has 3000 sequences, but 1000 OTUs and I would get a resulting FASTA file with 1000 sequences in it.
However, when I combine all my samples into one FASTA file/alignment along with a group file to treat in mothur together (as per the recommendation: OTU differences when analyzing separate fastas or one file), I can no longer get back the famous “OTU rep” sequences by group. I need those to classify them (external database) in order to see what happens to the percentages/proportions of the different bacterial genera/families/orders between my different samples.
I thought that get.sharedseqs might do it by using some combination of options, but I can’t seem to have it re-create the FASTA file with, for example for sample A above, the 1000 sequences that represent the 1000 OTUs that belong to (not the same as unique to) sample A. Am I missing something in this command?
As a curiosity when testing this command, it outputted FASTA files containing the format:
AY457838 A 59
…as mentioned in the Wiki manual. This format with tabs in the title is not FASTA format (underscores are accepted but not tabs) and, for example, cannot be imported to open on the popular PC program BioEdit. Another “problem” is that the get.sharedseqs command now outputs “Seq name - group - OTU#” as the title whereas get.oturep outputs “Seq name - OTU# - # of seqs represented” which you can sort by size. This last parameter of the number of seqs represented is highly important and I would like to have that info in order to, for example, classify the “abundant OTUs” (those OTUs representing >1% of the total number of seqs), hence why I liked get.oturep.
Back to the question at hand: Just for fun, I tried using get.oturep(phylip=X.phylip.dist, fasta=X.fas, list=X.phylip.fn.list, label=0.03, sorted=size, group=sampleA) even though the group option is supposedly not supported (according to the Wiki) in the command. It did run…until my memory ran out (my 33,000 sequence .dist file is 3.7 Gb). I then added the large=true option and it ran, but I killed the process as the temp file was starting to take up all of the hard drive storage space. I might try again today with more disk available…Is this theoretically supposed to work?
If not, is there a way of getting to the same info using a combination of other commands with the list files, etc. that I haven’t pieced together yet?