[Solved]: Under what conditions is K-means clustering transformation-invariant?

Problem Detail: Given a set of data points $X = {x_1, x_2, ldots, x_m}$ where $x_i in mathbb{R}^d$ we run K-means on $X$ and obtain the clusters $c_1, c_2, ldots, c_k$. Now, if we create a new dataset $Y = {y_1, y_2, ldots, y_m}$ where $y_i = Ax_i + b$ and $y_i in mathbb{R}^d$ and run K-means on $Y$ to get clusters $g_1, g_2, ldots g_k$. Under what conditions of $A$ and $b$ are we guaranteed to get the same clusters? Let’s assume that K-means is using the euclidean distance and has same initial conditions on both algorithms, that is, if the initial centers for X are $c^0_1, ldots, c^0_k$ then the initial centers for Y are $g^0_1, ldots, g^0_k$ where $g^0_i = Ac^0_i + b$. So far I’ve thought that $A$ has to be full rank and $b$ can be any vector. However, I haven’t been able to prove it.

Asked By : Ana Echavarria

Answered By : Yuval Filmus

The answer depends on your K-means algorithm, but what follows should work for standard algorithms. You will get the same result if your transformation $T$ satisfies two conditions:

  1. It preserves distances: $d(z,w) = d(T(z),T(w))$, where $d$ is your metric, say $d(z,w) = |z-w|$.
  2. It preserves averages: if $sum_i p_i z_i$ is a convex combination that $T(sum_i p_i z_i) = sum_i p_i T(z_i)$.

You can check this by going over the algorithm, showing that it always makes the same choices.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/64843  Ask a Question  Download Related Notes/Documents