[Solved]: Reconstructing a data table from cross-tabulation frequencies

Problem Detail: Say there is a data table $D$ that we cannot see, with $M$ columns. We are given exact cross-tabulation frequencies for all ${M choose 2}$ pairs of columns, that is how often each combination of two values occurs. From the cross-tabulations, we can derive a set of possible rows $R$ of $D$ and maximum frequencies for each possible row. How can we reconstruct the original table $D$? If there is not enough information to do so, how can we construct a possible table $D’$ that has the same cross-tab frequencies? In this case, is it possible to count the number of possible tables? (Edit: As Vor noted, define a table as an unordered collection of rows. A permutation of the rows of a table yields the same table.) For example, if $D$ has rows:

X  A  j Y  A  k X  B  j X  B  j 

We have three sets of cross-tab frequencies:

X  A  1 X  B  2 Y  A  1 Y  B  0  X  j  3 X  k  0 Y  j  0 Y  k  1  A  j  1 A  k  1 B  j  2 B  k  2 

I would like an algorithm which will take the cross-tab frequencies as input and output the original table or a possible original table.

Asked By : Sarkom

Answered By : Raphael

We have $M$ columns $C_1, dots, C_M$ in the original table, each with a finite set of distinct values $V_1, dots, V_M$. Given the cross-table frequencies $c_{i,j}(a,b)$ (frequency of $a$ in $V_i$ and $b$ in $V_j$ occurring together), proceed as follows:

  1. Determine $V_1, dots, V_M$ and denote $V_i = {v^{(i)}_j mid j in [1,n_i] }$ where $n_i = |V_i|$.
  2. Solve the following integer program: $qquaddisplaystyle min sum_{i_1 = 1}^{n_1} cdots sum_{i_M = 1}^{n_M} x_{(i_1,dots,i_M)} qquad begin{align} text{s.t. } sum_{i_1 = 1}^{n_1} cdots sum_{i_{l-1} = 1}^{n_{l-1}} sum_{i_{l+1} = 1}^{n_{l+1}} cdots sum_{i_{r-1} = 1}^{n_{r-1}} sum_{i_{r+1} 1}^{n_{r+1}} cdots sum_{i_M = 1}^{n_M} x_{(i_1, dots, i_M)} &= c_{(l,r)}(i_l,i_r) qquad text{for all } 1 leq l < r leq M, i_l in [1,n_l], i_r in [1,n_r], x_{(i_1, dots, i_M)} &in mathbb{N} qquad text{for all } i_1 in [1,n_1], dots, i_M in [1,n_M] end{align}$

The $x_{(i_1, dots, i_M)}$ now indicate which row should occur how often, and we get a solution with the minimum number of rows possible. If we can influence the search, we can abort as soon as we find any feasible solution. This is not likely to be efficient, but there are honed solvers for integer programs around so it’s simple to implement. It may be possible to use an LP-solver and round to a feasible solution.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/4784  Ask a Question  Download Related Notes/Documents