Came across this paper earlier today for a new amplicon sequence clustering software:
Seems like an interesting, slightly different approach to clustering than previous software has taken. My thoughts based on a quick test with some data from a project I’m working on:
- It works.
- Its extremely fast. 10-15 mins to cluster ~500,000 uniques. Mothur actually fails to cluster this dataset at all (only gets to unique or 0.01, per my recollection), so that is a plus as well.
- Its trivial to integrate with mothur, the output is similar enough to a list file to be be converted with 15 seconds of effort.
- It produces a lot of OTUs with the default -d 1 parameter as compared to a typical 97% cutoff. This can be adjusted, though.
Curious what others think on their methodology, or other comments on the paper/software.
My recollection is that it is essentially a nearest neighbor algorithm, which can be problematic as it will likely chain together OTUs…
Makes sense. The software performed pretty well on a small mock community (59 organisms), but I can see how chains might be formed in a natural community that might have 10,000+ species. It wasnt compared to mothur though, which I expect wouldn’t have much trouble on such a simple sample either. It would be interesting if someone was able to create a mock community with a large number of species (500? 1000?) and test various clustering algorithms. Im not aware of anyone who’s done that.
The authors acknowledged this problem in the paper and here: https://github.com/torognes/swarm#refine_OTUs. Not sure how much their method can mitigate the problem though, I wasnt able to test their script.
Thanks for mentioning Swarm!
Swarm is a two-step clustering method. Patrick Schloss is right, the first step is a nearest neighbor algorithm (i.e. single linkage clustering). For short or slow evolving amplicons there is indeed a risk of chaining together OTUs. That’s why swarm uses by default a small d value (the maximum number of allowed differences between two amplicons) to minimize that risk. On the natural (and very large) communities I’ve been working with, that chaining problem impacted a limited number of OTUs. Nevertheless, we developed a second step algorithm that finds and breaks all chains. It works very well, and allows swarm to conserve a high precision on a wide range of d values, but we are still experimenting to find the fastest way to do it. That’s why that second step is currently performed by an external python script. Once the algorithm will be stabilized, we will rewrite it in C++ and embed it into swarm.
Adam, you mentioned that you were not able to test the python script. Could you please explain why? Was it incompatible with your python set up? I’ll be very interested to know.
My sequences are named Samplename_sequencenumber, and I think the script is expecting Sequencename_sequencecount, so its failing when trying to interpret a sequence name like SampleA_1C2 (my sequence numbers are in hex) – it wants 1C2 to be an int, when its not. The more important problem being that there arent any abundance counts in the name at all.
Its not really a bug in your script, I just need write to something to rename my sequences using the actual counts from mothur’s count file if I want to test it. It would be great if swarm or its companion scripts could just read mothur’s count file (http://www.mothur.org/wiki/Count_File), since renaming the sequences will break things later on in mothur unless they get renamed back to their original names.
https://github.com/torognes/swarm#version-126 as of now swarms puts out mothur format list file.
I.m not quite sure how to make a working shared file out of it, because it will be created from dereplicated sequences so just combining it with gorup file shouln’t produce correct abundance information, seems like building some your own count file should be needed first.
Has anybody made it through with swarm->list file->shared file. I’d be interested in trying out swarm made otus in mothur downstream.
If anybody has some clues how to get shared file from swarm created list file I’d really appreciate them sharing.