Problem Detail: I have an array of real numbers, I want to partition them into k sets. In each set, I calculate the sum of squared error. Then, I add up all the sum of squared error for all the set. I want to minimize this number. For example: 1 3 5 9 if k=2, I would partition them into (1 3 5) and (9), the sum of squared error of (1 3 5) is (3-1)^2+(3-3)^2+(5-3)^2=8 and the sum of squared error of (9) is 0. So the total sum of squared error is 8. I think 8 is the minimum sum of squared error in this case. I want to use the traditional prim’s method to solve this problem. i.e. to use connect n-k edges, then stop. But the problem is whenever I add 1 more integers to the subset, the mean changes. So, it seems that I cannot use Prim’s method…Anyone give me some insight on this? Thanks!
Asked By : user196736
Answered By : Nick
You are looking for an optimal 1-dimensional k-means algorithm. The k-means objective function for partitioning the data $x_1, ldots, x_n$ into $k$ sets $S = {S_1, ldots, S_k}$. $$ sumlimits_{i=1}^k sumlimits_{x in S_i}lVert x – mu_i rVert^2 $$ where $mu_i$ is the mean of $S_i$ [1]. You can apply a dynamic programming algorithm to the problem [2].
- Let the data $x_1, ldots, x_n$ be sorted in non-decreasing order.
- Fill a $(n + 1) times (k + 1)$ array, $D$, with the minimum sum of squared errors, for the first $i$ entries, using $m$ clusters. This can be calculated as $$ D[i,m] := minlimits_{1 le j le i} left{ D[j-1,m-1] + d(x_j, ldots, x_i) right} $$ where $d(x_j, ldots, x_i)$ is the sum of squared errors from their mean. For $1 le i le n$ and $1 le m le k$, and the base cases are $d[i,m] = 0$ if $i=0$ or $m=0$. Thus, the entry $D[n,k]$ contains the optimal sum of squared errors for the original problem, now all that is left to do is to extract the answer. Note that $d(x_j, ldots, x_i)$ can be computed iteratively to speed up the algorithm, as $$begin{align*} d(x_j, ldots, x_i) &= d(x_j, ldots, x_{i-1}) + frac{i-1}{i} (x_i – mu_{i-1})^2 mu_i &= frac{x_i + (i – 1) mu_{i-1}}{i}end{align*} $$ Thus, $D$ can be computed in $O(n^2k)$ time, since there are $nk$ entries, and each takes $O(n)$ time to compute.
- Now, we fill a $n times k$ array, $B$, whose entries, $B[i,m]$, contain the index of the smallest (first) element in the cluster that contains entry $i$, which can be calculated as $$ B[i,m] := argminlimits_{1 le j le i} left{ D[j-1,m-1] + d(x_j, ldots, x_i right} $$ for $1 le i le n$ and $1 le m le k$. This takes $O(n^2k)$ time. Thus, we can backtrack from $B[n,k]$ to find the clustering. The clusters are defined recursively as $$begin{align*} S_k &= {B[n,k], ldots, x_n} S_i &= {B[x_j,k], ldots, x_j mid x_{j+1} := min S_{i+1}} end{align*}$$ The backtracking takes $O(k)$ time.
The algorithm runs in $O(n^2k)$ time, and requires $O(nk)$ space. [1] http://en.wikipedia.org/wiki/K-means_clustering [2] http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/33696