KMeans++ Algorithm
Smart initialization:
- Choose first cluster center uniformly at random from data points
- For each obs x, compute distance d(x) to nearest cluster center
- Choose new cluster center from amongst data points, with probability of x being chosen proportional to d(x)2
- Repeat Steps 2 and 3 until k centers have been chosen
Run the standard KMeans with centers got above
- Assign observations to closest cluster center
- Revise cluster centers as mean of assigned observations
- Repeat 1.+2. until convergence
KMeans++ visualized
Elkan K-Means Algorithm (距离计算优化)
在传统的K-Means算法中,我们在每轮迭代时,要计算所有的样本点到所有的质心的距离,这样会比较的耗时。那么,对于距离的计算有没有能够简化的地方呢?elkan K-Means算法就是从这块入手加以改进。它的目标是减少不必要的距离的计算。那么哪些距离不需要计算呢?
elkan K-Means利用了两边之和大于等于第三边,以及两边之差小于第三边的三角形性质,来减少距离的计算。
- 对于一个样本点$x$和两个质心$\mu_1$,$\mu_2$.如果我们预先计算出了这两个质心之间的距离$D(\mu_1,\mu_2)$,则如果计算发现 $2D(x,\mu_1) \leq D(\mu_1,\mu_2)$, 我们可以得知 $2D(x,\mu_1) \leq D(x,\mu_2)$,此时我们不需要再计算$D(x,\mu_2)$,也就是说省了一步距离计算。
- 对于一个样本点$x$和两个质心$\mu_1$,$\mu_2$,我们可以得到 $D(x,\mu_2) > max\lbrace 0, D(x,\mu_1) - D(\mu_1,\mu_2)\rbrace$
MapReduce for scaling KMeans
What is MapReduce
Scaling up KMeans via MapReduce
- Classify: Assign observations to closest cluster center
For each data point, given $(\mu_{j}, x_{i} )$, $emit(z_i, x_i)$
- Recenter: Revise cluster centers as mean of assigned observations
Reduce: Average over all points in cluster $j(z_j = k)$