## KMeans++ Algorithm

1. Smart initialization:

1. Choose first cluster center uniformly at random from data points
2. For each obs x, compute distance d(x) to nearest cluster center
3. Choose new cluster center from amongst data points, with probability of x being chosen proportional to d(x)2
4. Repeat Steps 2 and 3 until k centers have been chosen
2. Run the standard KMeans with centers got above

1. Assign observations to closest cluster center
2. Revise cluster centers as mean of assigned observations
3. Repeat 1.+2. until convergence

## Elkan K-Means Algorithm (距离计算优化)

elkan K-Means利用了两边之和大于等于第三边,以及两边之差小于第三边的三角形性质，来减少距离的计算。

1. 对于一个样本点$x$和两个质心$\mu_1$,$\mu_2$.如果我们预先计算出了这两个质心之间的距离$D(\mu_1,\mu_2)$,则如果计算发现 $2D(x,\mu_1) \leq D(\mu_1,\mu_2)$, 我们可以得知 $2D(x,\mu_1) \leq D(x,\mu_2)$,此时我们不需要再计算$D(x,\mu_2)$,也就是说省了一步距离计算。
2. 对于一个样本点$x$和两个质心$\mu_1$,$\mu_2$,我们可以得到 $D(x,\mu_2) > max\lbrace 0, D(x,\mu_1) - D(\mu_1,\mu_2)\rbrace$

## MapReduce for scaling KMeans

#### Scaling up KMeans via MapReduce

1. Classify: Assign observations to closest cluster center

For each data point, given $(\mu_{j}, x_{i} )$, $emit(z_i, x_i)$

1. Recenter: Revise cluster centers as mean of assigned observations

Reduce: Average over all points in cluster $j(z_j = k)$