Order matters In cluster() for OTU??

yingeddi2008 · December 18, 2013, 11:55pm

Hi All,
I recently found something interesting of the cluster() command. My colleague and I are working on the same project, starting with the same sff file, and the same procedure to process the data, the only thing that we are doing differently is the names we assigned to each sample, he used a longer name according to the barcode names of the sample, I randomly named them some unique names. However, we got different numbers of OTU in our final results. I have 4518 OTUs in total while he has 4519 OTUs, but the total number of OTU counts is the same, 323791. Also we have more than half of the taxonomy different too.

I went back and checked every step of both of our procedure. The input file of his and mine are of the same size and have the same number of sequence distance.

Below is mine:

[hl0333@talon2 icldenoise]$ ll all.otu.unique.dist
-rw-r--r-- 1 hl0333 pi_qd0005 130740533 Dec 13 15:39 all.otu.unique.dist
[hl0333@talon2 icldenoise]$ wc -l all.otu.unique.dist
3498894 all.otu.unique.dist

We had the same thing!

[hl0333@talon2 2013_12_13]$ ll all.otu.unique.dist
-rw-r--r-- 1 kvr0011 pi_qd0005 130740533 Dec 14 21:52 all.otu.unique.dist
[hl0333@talon2 2013_12_13]$ wc -l all.otu.unique.dist
3498894 all.otu.unique.dist

But when I look a little closely, I found out that maybe the order of the distance matrix is different.

[hl0333@talon2 icldenoise]$ tail all.otu.unique.dist
H6HR9S002C1YP0 H6HR9S002EVG4S 0.1147
H6HR9S002C1YP0 H6HR9S002DOHC1 0.0719
H6HR9S002C1YP0 H6HR9S002D07HO 0.1457
H6HR9S002C1YP0 H6HR9S002C35Y8 0.1104
H6HR9S002C1YP0 H6HR9S001BWGTO 0.06103
H6HR9S002C1YP0 H6HR9S001B75SV 0.1289
H6HR9S002C1YP0 H6HR9S002D8AIM 0.006536
H6HR9S002C1YP0 H6HR9S002D0MLE 0.1027
H6HR9S002C1YP0 H6HR9S002ENJGW 0.1496
H6HR9S002C1YP0 H6HR9S002DZKEU 0.1435
[hl0333@talon2 icldenoise]$ head all.otu.unique.dist
H6HR9S001ASM9H H6HR9S001B8AKN 0.1544
H6HR9S001AJPPR H6HR9S001AUV3U 0.06787
H6HR9S001ANM0C H6HR9S001B8AKN 0.1191
H6HR9S001ANM0C H6HR9S001AQCYC 0.1025
H6HR9S001A4UIZ H6HR9S001B8AKN 0.07191
H6HR9S001A4UIZ H6HR9S001ASM9H 0.1261
H6HR9S001A4UIZ H6HR9S001AQCYC 0.1526
H6HR9S001A4UIZ H6HR9S001ANM0C 0.1109
H6HR9S001BENDH H6HR9S001B8AKN 0.1542
H6HR9S001BENDH H6HR9S001AQCYC 0.05011

Below is his .dist file, we had the same head, but different tail.

[hl0333@talon2 2013_12_13]$ tail all.otu.unique.dist
H6HR9S002C136L H6HR9S002DGG2Z 0.1422
H6HR9S002C136L H6HR9S002DE5R1 0.09324
H6HR9S002C136L H6HR9S002D6BRF 0.07143
H6HR9S002C136L H6HR9S002D4FH9 0.1282
H6HR9S002C136L H6HR9S002D2CEF 0.09007
H6HR9S002C136L H6HR9S002C97HF 0.08525
H6HR9S002C136L H6HR9S002C7CKB 0.09677
H6HR9S002C136L H6HR9S002C5WJ0 0.09908
H6HR9S002C136L H6HR9S002C31Y3 0.08159
H6HR9S002C136L H6HR9S002C14XS 0.05714
[hl0333@talon2 2013_12_13]$ head all.otu.unique.dist
H6HR9S001ASM9H H6HR9S001B8AKN 0.1544
H6HR9S001AJPPR H6HR9S001AUV3U 0.06787
H6HR9S001ANM0C H6HR9S001B8AKN 0.1191
H6HR9S001ANM0C H6HR9S001AQCYC 0.1025
H6HR9S001A4UIZ H6HR9S001B8AKN 0.07191
H6HR9S001A4UIZ H6HR9S001ASM9H 0.1261
H6HR9S001A4UIZ H6HR9S001AQCYC 0.1526
H6HR9S001A4UIZ H6HR9S001ANM0C 0.1109
H6HR9S001BENDH H6HR9S001B8AKN 0.1542
H6HR9S001BENDH H6HR9S001AQCYC 0.05011

Just in case, we may get different distance number, I double checked in my file. We have the same distance number.

[hl0333@talon2 icldenoise]$ grep "H6HR9S002C136L H6HR9S002C14XS" all.otu.unique.dist
H6HR9S002C136L H6HR9S002C14XS 0.05714

So my question is that, the order of distance matrix will affect my final result? How is that even right? Also, if only the different between me and my colleague is the names of the samples, that would be really odd, if I name them differently, we would get different results? Only because of names?

Or I hope, I was just paranoid, or just missed some randomness of the cluster() command?

Thank you,

Eddi

yingeddi2008 · December 20, 2013, 8:03pm

Nobody reply. I feel sad…Or it’s just holiday, everyone is busy shopping…? I am leading a pathetic life…

westcott · January 2, 2014, 5:59pm

Sorry for the delayed response. Pat explains this on the cluster command page, http://www.mothur.org/wiki/Cluster#Variability. “This is because there was a tie. A sequence could have joined more than one pre-existing OTU. mothur is programmed to randomly select the OTU that it should join. Because of this, it is possible to get differences between runs. This is just a byproduct of using an algorithm-based approach to clustering.”

Topic		Replies	Views
Making OTUs without distance matrix Theory behind mothur	8	899	September 29, 2019
cluster() doesn't accept custom distance similarities? Commands in mothur	2	29918	January 28, 2010
Problems handling a >50 Gb distance matrix (cluster command) mothur bugs	12	14763	October 18, 2013
Command cluster_issue	17	957	November 28, 2021
OTU clusters Commands in mothur	1	1984	August 29, 2013

Order matters In cluster() for OTU??

Related topics