I have a large set (192) of MiSeq V34 samples, so the distance matrix from cluster.split would be far too large to run all the samples on one machine. I’m just wondering about the theoretical implications of splitting the samples into smaller sets of say, maybe 10 samples, and processing them separately. I understand I wouldn’t be able to merge the shared files since the OTU’s would be different, but will the actual OTU results for each sample be different if I analyse them separately instead of all together? My gut (no pun intended) feeling is yes they would be different, since the clustering would be different, but I wanted to get a second opinion.
Yeah, the OTU assignments and numbering will be very different. I’d probably suggest just doing classify.seqs and phylotype. For future reference, what you’re seeing with sequencing the V3-V4 region isn’t new: http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/
Thanks for the info. I had already come across that article, which was super helpful! I’m just trying to understand why the OTU table for an individual sample is different depending on which samples are analysed together. If you could shed some light on that that would be great.
The hierarchical clustering algorithms perform by looking at the data you have and then clustering the sequences based on the distances between the sequences and the abundances of sequences. So as you add more data, you’re really adding more information. What we’ve found is that the clustering gets better with more data. So, because this is a reference-independent approach, the clustering isn’t static as you add more data. It might be frustrating, but it really is a feature and not a bug
I was about to open a new topic in this board requesting your and other users thoughts about the “OTU stability” paper in microbiome by He et al. (http://www.biomedcentral.com/content/pdf/s40168-015-0081-x.pdf).
But then I got intrigued by your last comment:
What we’ve found is that the clustering gets better with more data. So, because this is a reference-independent approach, the clustering isn’t static as you add more data.
I myself was wondering quite a bit about that. Although I’m not very fond of a reference-dependent OTU-picking algorithm (even if it is “open reference”) from the viewpoint of the discrepancy with reference data in the databases and what is really out there, I do feel that reproducibility is a big plus.
One can imagine settings in which a dataset would be expanding (e.g. time series or additional sequencing of samples of groups deemed interesting by previous sequencing runs) and afterwards theconclusions from the initial analysis would turn invalid.
However, you say that clustering gets better with more data. What exactly do you mean with that? What is a good OTU? Will you get OTU inflation with, say, 100 samples as compared to 50 samples?
I am also curious if cluster.split does not in fact stabilize OTUs by imposing some reference-based framework for clustering?
As an FYI, my above post does not necessarily contradict my opinion in a previous post (clustering against pre-clustered data?). But I do understand now why one would need datasets that increase in size over time as I’ve been involved in some industrial projects in our lab were an external partner wants to follow up the community composition in a certain process and even does decision making based upon previous analysis results of the reduced dataset to steer communities. In this case stable OTU clustering essential.
I’m in the process of writing a rebuttal article that shows their paper is… lacking. Sure their algorithms are reproducible. But they’re also reproducibly bad. You may run average neighbor multiple times and get slightly different answers, but they’re all just as good and overall much better than the methods used in the paper. In other words, we get you right around the bulls eye, they always get you something far from the bulls eye.
Looking forward to read that rebuttal (keep me posted ), I also had very interesting discussions with Sebastian Schmidt from the Von Mering group (HPC-clust) on the topic which also essentially boil down to what your reply, although their group is also interested in the open reference clustering (if performed correctly).
Frederiek - Maarten