cluster: agc and dgc

Hello community, it is me again…
I want to understand the implementation of vsearch in Mothur.

Well this time around, I have a question about the differences between the agc (abundance) or dgc (distance) methods for clustering using vsearch.

I have search the internet, read a couple of paper, but still I cannot get the sense of it and when or why used them.

can anyone help?

thanks in advance

What did you sequence? For v4, don’t use vsearch for clustering. For ITS2, I used abundance but distance is also acceptable just make sure you mention what you use in your methods

Thanks kmitchell

I am just playing with the program to compare analysis worlflows. I just want to understand of it all and ultimately make my own decision about using a workflow or not depending on the data set i run into.

Yea i am using V4, and vsearch is lightspeed compared to cluster.split. I did read that the label was not too stable in between clustering iteration when using vsearch, but I still wanted to see if the biological conclusions drawn from vsearch, cluster.split and even phylotype are the same and to what extent they are different. Ultimately, we are planning on a super hudge analysis with something like 10ish MiSeq run in it so I will need some computing power to run that and vsearch might help to lower my computing requirement.

So basically my question was more about understanding the actual command more then if may I should use it or not.

Of course a greedy cluster is faster. The questions you should consider when deciding how to handle your data shouldn’t be what is the fastest way to process, but what best answers the questions I have. Blasting to get taxonomic IDs then collapsing those IDs would be fastest-doesn’t mean anyone should be doing that (unless their data has significant errors and they’re trying to salvage it)

I’m with kmitchell on this. I often hear people say they want to “try things out and see what it looks like”. I then wonder, what would it have to look like to affect their choice. We ran the experiments with different methods and although USEARCH/VSEARCH were often close to average neighbor, an was reliably the best…


Pat

Thanks you for the output.

Prior to my questions, I have read the articles and yes I am convinced that it is better to follow the SOP as it.

So for huge databases (close to 1000 samples), so basically I should only get more computational power, which I will try to find.

But still, you cannot complain about a guy who is just trying to understand all the different workflows that are up there, instead of just following blindly the SOP. Since I am stubborn, I will run different things in parallel. I need to play with the data to understand it better and learn it better and argue it when the time comes. I will continue to post my questions here if I may and ask about things I do not understand, if that case arise. But for publication, I will aim at following what you are suggesting. After all, you have much more experience with playing with the dataset then I will ever do. So I will follow you wisdom.

Again, thanks a lot for the time you are using for maintaining the forum. I read it all the time, even when I do not have any trouble with the data analysis.

“I then wonder, what would it have to look like to affect their choice”

humm, a puzzling question. From my point of view: I guess you must use the method that correspond to your resources, that will give you the best output possible, which means that you can be relatively confident in your main biological conclusion. But you need to be aware of the rest and include some of that in your discussion so that you do not over interpret your results.

Apart then that, there is already so much differences between experiments (DNA extractions, matrix, 16s region being analyzed, workflow) so in tern of downstream analysis, the more stable the best. That’s why we choose Mothur in our lab, to get things more uniform.

For now, my feeling is to use vsearch for a super-fast answer to get the feeling out of things and then just recsluster everything overnight. The global biological conclusion should not change that much in between analysis, but when it comes down to dive deeper in the data and to compare to the literature, the most stable/reliable the better. And the more you sequence, the better you become to wet lab it, and the better the quality of data that comes out of it. We are experiencing that right now. “Suck until you don’t”.

But again, you must make the best of what is available to you in term of computer and data quality and $$ and time and pressure to publish and such and just do better as you learn.

Just an opinion.