Processing samples in subsets

clsmith · July 31, 2015, 9:09pm

I have a large set (192) of MiSeq V34 samples, so the distance matrix from cluster.split would be far too large to run all the samples on one machine. I’m just wondering about the theoretical implications of splitting the samples into smaller sets of say, maybe 10 samples, and processing them separately. I understand I wouldn’t be able to merge the shared files since the OTU’s would be different, but will the actual OTU results for each sample be different if I analyse them separately instead of all together? My gut (no pun intended) feeling is yes they would be different, since the clustering would be different, but I wanted to get a second opinion.

Thanks!

pschloss · August 3, 2015, 2:48pm

Hi there,

Yeah, the OTU assignments and numbering will be very different. I’d probably suggest just doing classify.seqs and phylotype. For future reference, what you’re seeing with sequencing the V3-V4 region isn’t new: http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Pat

clsmith · August 4, 2015, 1:41pm

Hey Pat,

Thanks for the info. I had already come across that article, which was super helpful! I’m just trying to understand why the OTU table for an individual sample is different depending on which samples are analysed together. If you could shed some light on that that would be great.

Thanks!

Christopher

pschloss · August 5, 2015, 2:37pm

HI Christopher,

The hierarchical clustering algorithms perform by looking at the data you have and then clustering the sequences based on the distances between the sequences and the abundances of sequences. So as you add more data, you’re really adding more information. What we’ve found is that the clustering gets better with more data. So, because this is a reference-independent approach, the clustering isn’t static as you add more data. It might be frustrating, but it really is a feature and not a bug

Pat

FM_Kerckhof · August 5, 2015, 5:53pm

Hi Pat,

I was about to open a new topic in this board requesting your and other users thoughts about the “OTU stability” paper in microbiome by He et al. (http://www.biomedcentral.com/content/pdf/s40168-015-0081-x.pdf).
But then I got intrigued by your last comment:

What we’ve found is that the clustering gets better with more data. So, because this is a reference-independent approach, the clustering isn’t static as you add more data.

I myself was wondering quite a bit about that. Although I’m not very fond of a reference-dependent OTU-picking algorithm (even if it is “open reference”) from the viewpoint of the discrepancy with reference data in the databases and what is really out there, I do feel that reproducibility is a big plus.
One can imagine settings in which a dataset would be expanding (e.g. time series or additional sequencing of samples of groups deemed interesting by previous sequencing runs) and afterwards theconclusions from the initial analysis would turn invalid.
However, you say that clustering gets better with more data. What exactly do you mean with that? What is a good OTU? Will you get OTU inflation with, say, 100 samples as compared to 50 samples?
I am also curious if cluster.split does not in fact stabilize OTUs by imposing some reference-based framework for clustering?

Kind regards,

FM

FM_Kerckhof · August 6, 2015, 8:18am

As an FYI, my above post does not necessarily contradict my opinion in a previous post (clustering against pre-clustered data?). But I do understand now why one would need datasets that increase in size over time as I’ve been involved in some industrial projects in our lab were an external partner wants to follow up the community composition in a certain process and even does decision making based upon previous analysis results of the reduced dataset to steer communities. In this case stable OTU clustering essential.

pschloss · August 10, 2015, 2:29pm

I’m in the process of writing a rebuttal article that shows their paper is… lacking. Sure their algorithms are reproducible. But they’re also reproducibly bad. You may run average neighbor multiple times and get slightly different answers, but they’re all just as good and overall much better than the methods used in the paper. In other words, we get you right around the bulls eye, they always get you something far from the bulls eye.

Pat

FM_Kerckhof · August 12, 2015, 1:21pm

Hi Pat,

Looking forward to read that rebuttal (keep me posted ), I also had very interesting discussions with Sebastian Schmidt from the Von Mering group (HPC-clust) on the topic which also essentially boil down to what your reply, although their group is also interested in the open reference clustering (if performed correctly).

Kind regards,

Frederiek - Maarten

Topic		Replies	Views
sub.sample before OTU clustering? Commands in mothur	2	2507	October 21, 2014
Making OTUs without distance matrix Theory behind mothur	8	848	September 29, 2019
Combining samples from a run Theory behind mothur	2	1354	February 17, 2017
Clustering OTUs Commands in mothur	5	1426	March 1, 2017
Cluster.split Commands in mothur	1	1920	December 20, 2014

Processing samples in subsets

Related topics