KMeans Improving

KMeans++ Algorithm

Smart initialization:
1. Choose first cluster center uniformly at random from data points
2. For each obs x, compute distance d(x) to nearest cluster center
3. Choose new cluster center from amongst data points, with probability of x being chosen proportional to d(x)2
4. Repeat Steps 2 and 3 until k centers have been chosen
Run the standard KMeans with centers got above
1. Assign observations to closest cluster center
2. Revise cluster centers as mean of assigned observations
3. Repeat 1.+2. until convergence

在传统的K-Means算法中，我们在每轮迭代时，要计算所有的样本点到所有的质心的距离，这样会比较的耗时。那么，对于距离的计算有没有能够简化的地方呢？elkan K-Means算法就是从这块入手加以改进。它的目标是减少不必要的距离的计算。那么哪些距离不需要计算呢？

elkan K-Means利用了两边之和大于等于第三边,以及两边之差小于第三边的三角形性质，来减少距离的计算。

对于一个样本点\(x\)和两个质心\(\mu_1\),\(\mu_2\).如果我们预先计算出了这两个质心之间的距离\(D(\mu_1,\mu_2)\),则如果计算发现 \(2D(x,\mu_1) \leq D(\mu_1,\mu_2)\), 我们可以得知 \(2D(x,\mu_1) \leq D(x,\mu_2)\),此时我们不需要再计算\(D(x,\mu_2)\),也就是说省了一步距离计算。
对于一个样本点\(x\)和两个质心\(\mu_1\),\(\mu_2\),我们可以得到 \(D(x,\mu_2) > max\lbrace 0, D(x,\mu_1) - D(\mu_1,\mu_2)\rbrace\)

Classify: Assign observations to closest cluster center \[z_i = \mathop{\arg\min}_{j}\Vert \mu_{j} - x_{j} \Vert^2_2\]

For each data point, given \((\mu_{j}, x_{i} )\), \(emit(z_i, x_i)\)

\[\mu_j = \frac{1}{n_j}\sum\limits_{i:z_i = k} x_i\]

Reduce: Average over all points in cluster \(j(z_j = k)\)