[Solved]: match an array with a given set of arrays

Problem Detail: We have a set of 24 distinct arrays, each array has 36 elements and each element can have one of 13 possible values. Then we’re given an array X (this array is certainly part of our set) and we have to find to which array of our set it corresponds to. Of course, we can check all 36 elements of X against each of 36 elements of our 24 arrays and find out with which it corresponds to. I’m trying to reduce this number of 36 comparisons. My naive approach is something like this:

Take all possible subsets of 0..35, except the empty subset. Each of this subsets denotes which elements we should compare.
Use each of this subsets with input all the 24 arrays of our given set and check which of these subsets produce as output the given set of our 24 arrays, these are valid subsets.
From these valid subsets, the one with the fewer elements is our solution.

Unfortunately, all possible subsets of 0..35 excluding the empty subset is $2^{36}-1$ I was wondering if there could be another approach to the problem. Many different arrays X will be matched against the given set of 24 arrays so any preprocessing of the 24 arrays is really worth the effort. User JarkkoL suggested to precompute CRC32 of the set of the given arrays and then compare with the CRC32 of the array X. Still, I’m wondering if there’s a smart way not to calculate CRC32 for the whole array. I’m looking for the smallest possible number of elements for which I should compute it.

Asked By : egwspiti

Answered By : babou

Let $s$ be the size of an array (36 in your example), $v$ the number of possible values (here 13) of each element, $n$ the numbers of reference arrays (here 24), i.e., the size of the set $S={A_i mid iin[1..n]}$ of arrays. I am assuming that the cost for preprocessing the arrays does not matter. Else, you can alsways simplify the procedure below, as long as you do it well enough no to exceed the worst case time complexity $O(n)$. Ideally you want to get a balanced binary tree of test for identification so that you get your answer in $lceil log_2 n rceil$ tests, i.e. 5 tests at most with the given figures. In other words, you are looking for 5 bits of information, that allow you to discriminate between your $n=24$ arrays. For that you would need a criterion, testable in one comparison, that can cut approximately in two any given set of arrays, so that if $n_1$ and $n_2$ are the sizes of the 2 subsets, $lfloor log_2 n_1 rfloor = lfloor log_2 n_2 rfloor$. However, that is not always possible. A worst case example is, when array $A_i$ contains only $0$, except for element $A_i[i]$ that contains $1$. Then the only way to identify $X$ in the reference set is to scan the elements of $X$ to find which has value $1$. This has complexity $O(n)$ So you you know complexity can be as bad as that, when an index will discriminate at best one array. Of cours, in general, you have to analyse the set of arrays to determine which indices you have to look at. However, if the set of arrays permit, you can do better, as indicated above. For each index $i$ for the arrays, you try to identify a simple test $T_i$ that will partition the set $S$ of arrays in two subsets $S_1$ and $S_2$ of nearly equal sizes: $n_1$ and $n_2$, such that $|n_1-n_2|$ is minimal (this is a bit stronger than the condition above, but the idea is to keep some slack on each half for future partitions). You do that for each index $iin[1..s]$, and you keep the index $i_0$ that gives the minimal difference $|n_1-n_2|$. Of course, you can stop as soon as you have a difference that is less than $2$, or possibly as soon as $lfloor log_2 n_1 rfloor = lfloor log_2 n_2 rfloor$ (the choice is purely heuristic). You can even stop earlier if you are not too worried about getting the fastest test tree. Now for index $i_0$ you have a test $T_{i_0}$ that partitions your set $S$ of arrays into two disjoint subsets $S_1$ and $S_2$ of respective sizes $n_1$ and $n_2$, on the basis of the values at index $i_0$. This will be the first test you apply to a new array $X$ to identify it in the set $S$. This test will tell you whether it is in $S_1$ or $S_2$. Then you know that your identification of X will take at most $1+max(n_1,n_2)$ steps. Then you pursue the construction of your balanced tree, on each of the two subsets $S_1$ and $S_2$, and with some luck you can approach the optimal complexity $O(log_2 n)$. No garantee given, since a bad set of arrays can impose the maximum complexity $O(n)$, with the forced reading of, on average, n/2 elements of $X$. But there is still one very important detail that is missing. What kind of test is to be used. The answer is: any kind that is easy enough to count as one step. So I guess a primality test that partitions arrays into those with prime integer at index $i$ and those with non-prime at index $i$ is not a good idea. Suggestions are: testing if the $A_j[i]$ is equal to a given value, or whether it is greater than some value, or whether $A_j[i] mod p = r$ for two values $p$ and $r$. Just use your imagination, if there is a need for it. This can all be mechanized, tried automatically, but with a computation cost. What you should do depends on how much speeding up recognition by thorough preprocessing is essential for you.

A faster solution for large values of $n$, assuming $v$ remains small.

Actually, though the worst case $O(n)$ cannot be improved, one can do better than the first solution I propose above, assuming $v$ remains small. The idea is that one does not need to use binary tree. With more branching, one can get even faster to the answer. For that purpose, we need to build a function $phi$ that maps the $v$ values used into the integer range $[1..v]$. This can be done by hashing or by other means. Then we can assume without loss of generality that the values used are the integer in $[1..v]$. Then one could find a minimal set $I$ of indices for the arrays, so that for any pair of arrays $A_j$ and $A_k$ there is an index $iin I$ such that $A_j[i] neq A_k[i]$. Then, for any array $X$ known to be equal to an array $A_hin S$, we can identify $A_h$ by looking only at the values for the indices in $I$. One could then build a decision tree based on these indices considered in succession. However, finding such a minimal set of indices may be difficult (I did not look into its complexity), and there may be better solutions, by using independent indices for the different nodes on a given level of the decision tree. So we can probably get an even faster result by proceeding as follow. Choose a first index $i_0$ such that the $p$ different values for that index partition the set $S$ of arrays into $p$ subsets $S_x$ for $xin [1..p]$. Depending on whether you try to reduce the average cost or the maximum cost, you may try to choose $i_0$ so as to maximize $p$, or to minimize the maximum size of the subsets $S_x$ (though this is heuristic). These $p$ subsets are the daughters of the root in the decision tree being built. Then you repeat this operation for each subset, finding independently for each subset $S_x$ a new index $i_{x,1}$, so as to further partition the subset. You repeat the operation until you reach the leaves which correspond to singleton sets, i.e., a specific array in $S$. You get a widely branching decision tree, where each non-leaf node (actually corresponding to a subset of arrays) is labeled by an index, each branch is labeled by a value found for that index, and each leaf is labeled by an array in $S$. This tree must be implemented so that each branch can be accessed in one step by direct indexation. Then, given an array $X$, you read its value at the index $i_0$ labeling the root, and use this value as $phi(X[i_0])$ to access by indexation the next node in the tree. Then you repeat the step until you reach a leaf. When $v$ is large, the algorithm has to be adapted by using a different function $phi$ at each node.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/29539 Ask a Question Download Related Notes/Documents