Cluster command output

Hello there.

I am working with the Schloss SOP with my own sequences and I have a problem:

After running my ‘dist.seqs’ and my ‘cluster’ commands in order to find OTUs present in my fasta I get 3 files: .sabund, .rabund and .list.

How could I get a .fasta file with my cluster at 0.03, for instance? I do not want only the sequences name, but also the nucleotide sequence itself for the representatives of each OTU.

Is it possible?

Thanks a lot

try get.oturep

Something I’ve been wondering for a while - going the OTU-based approach in the SOP you create OTUs at various scales (say 97% similarity) and perform the downstream analysis based on that, but for a phylogenetic approach it seems to just be at the unique sequence level. If you wanted to go down the phylogenetic route, you could do the following:


dist.seqs() <- using the fasta output of get.oturep
etc…And just replace your original names file with the output files from get.oturep().

The thing I want to check is whether there’s a problem using dist.seqs a second time around. I would assume that no cutoff needs to be specified because the sequences have effectively been filtered to that level of dissimilarity already? Is this correct?

Ugh. Yeah you could do this, but why? I think part of the intended beauty of the phylogenetic approach is that you don’t have to apply a cutoff.

I don’t know how valid these reasons are (hence the asking), but the reasons I’ve been thinking about it:

Firstly, just as a data reduction technique. The SOP is to dereplicate, removing identical sequences to reduce the processing requirements. This is really just reducing your data into unique OTUs, so using a lower similarity threshold (say, 97% similarity instead of 100%) is just an extension of this. That said, I realise that the output of get.oturep is not a consensus sequence.

Also, regardless of whether you want to to an OTU- or phylogeny-based analysis, the traditional ecological diversity measures are those used in the OTU approach and I think it’s important to keep these in the workflow. That said, if you end up calculating Chao1/Simpson/Shannon/whatever estimators at a level other than unique, you’re going to dissociate the results of these calculators from the data you’re running through a unifrac. Case by case, it might be interesting to see if the most abundant OTUs are all phylogenetically similar (eg. different strains of E. coli) or quite different, and the only ways I can see to do this are to look at their taxonomies (if they’re very different, and only for broad comparison), or to do the method I said above.

Thank you very much. I’ve run get.oturep and it worked perfectly.