■ ■ ■ ■ □ □ □ □ □ □ □ □ □ ■ □ □ □ ■ ■ ■ □ □ □ □ □ □ ■ □ □ □ □ □ □ ■ ■ ■ □ □ □ ■ □ □ □ □ □ □ □ □ □ ■ ■ ■ □ ■ □ □ □ ■ □ □ □ ■ ■ □ □ □ ■ □ □ ■ □ □ □ ■ □ □ □ ■ □ ■ □ □ □ □ ■ ■ □ □ □ ■ □ <-- All row pairs share a black square □ □ ■ □ □ □ ■ □ ■ □ ■ □ □ □ □ ■ □ □ ■ □ ■ □ □ □ □ ■ □ □ ■ □ ■ □ □ □ □ ■ □ ■ □ □ □ □ ■ ■ □ □ ■ □ □ ■ □ □ □ □ □ ■ □ □ ■ □ □ ■ □ □ ■ □ □ □ ■ □ ■ □ □ ■ □ □ ■ □
How efficiently can two non-overlapping bitvectors be found, or be shown not to exist? The naive algorithm, where you just compare every possible pair, is $O(n^2 k)$. Is it possible to do better?
Asked By : Craig Gidney
Answered By : D.W.
Warmup: random bitvectors
As a warm-up, we can start with the case where each bitvector is chosen iid uniformly at random. Then it turns out that the problem can be solved in $O(n^{1.6} min(k, lg n))$ time (more precisely, the $1.6$ can be replaced with $lg 3$). We’ll consider the following two-set variant of the problem: Given sets $S,T subseteq {0,1}^k$ of bitvectors, determine where there exists a non-overlapping pair $s in S, t in T$. The basic technique to solve this is divide-and-conquer. Here is a $O(n^{1.6} k)$ time algorithm using divide-and-conquer:
- Split $S$ and $T$ based upon the first bit position. In other words, form $S_0 = {s in S : s_0=0}$, $S_1 = {s in S : s_0 = 1}$, $T_0 = {t in T : t_0 = 0}$, $T_1 = {t in T : t_0 = 1}$.
- Now recursively look for a non-overlapping pair from $S_0,T_0$, from $S_0,T_1$, and from $T_1,S_0$. If any recursive call finds a non-overlapping pair, output it, otherwise output “No overlapping pair exists”.
Since all bitvectors are chosen at random, we can expect $|S_b| approx |S|/2$ and $|T_b| approx |T|/2$. Thus, we have three recursive calls, and we’ve reduced the size of the problem by a factor of two (both sets are reduced in size by a factor of two). After $lg min(|S|,|T|)$ splits, one of the two sets is down to size 1, and the problem can be solved in linear time. We get a recurrence relation along the lines of $T(n) = 3T(n/2) + O(nk)$, whose solution is $T(n) = O(n^{1.6} k)$. Accounting for running time more precisely in the two-set case, we see the running time is $O(min(|S|,|T|)^{0.6} max(|S|,|T|) k)$. This can be further improved, by noting that if $k ge 2.5lg n+100$, then the probability that a non-overlapping pair exists is exponentially small. In particular, if $x,y$ are two random vectors, the probability that they’re non-overlapping is $(3/4)^k$. If $|S|=|T|=n$, there are $n^2$ such pairs, so by a union bound, the probability a non-overlapping pair exists is at most $n^2 (3/4)^k$. When $k ge 2.5 lg n+100$, this is $le 1/2^{100}$. So, as a pre-processing step, if $k ge 2.5 lg n + 100$, then we can immediately return “No non-overlapping pair exists” (the probability this is incorrect is negligibly small), otherwise we run the above algorithm. Thus we achieve a running time of $O(n^{1.6} min(k, lg n))$ (or $O(min(|S|,|T|)^{0.6} max(|S|,|T|) min(k, lg n))$ for the two-set variant proposed above), for the special case where the bitvectors are chosen uniformly at random. Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case — but let’s treat it as a warmup, to get some ideas that perhaps we can apply to the general case.
Lessons from the warmup
We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many $1$’s in that position as possible; the more $0$’s there are, the less reduction in subproblem size you get. Third, this suggests that the problem gets harder as the density of $1$’s gets smaller — if there are very few $1$’s among the bitvectors (they are mostly $0$’s), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density $Delta$ to be the fraction of bits that are $1$ (i.e., out of all $nk$ bits), and the density of bit position $i$ to be the fraction of bitvectors that are $1$ at position $i$.
Handling very low density
As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than $1/sqrt{k}$, we’re guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn’t help us find it, but at least we know it exists. Why is this the case? Let’s say that a pair of bitvectors $x,y$ is covered by bit position $i$ if $x_i=y_i=1$. Note that every pair of overlapping bitvectors must be covered by some bit position. Now, if we fix a particular bit position $i$, the number of pairs that can be covered by that bit position is at most $(n Delta(i))^2 < n^2/k$. Summing across all $k$ of the bit positions, we find that the total number of pairs that are covered by some bit position is $< n^2$. This means there must exist some pair that’s not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists. However, I’m at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don’t immediately see any techniques that would yield a running time that has a sub-quadratic dependence on $n$. So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.
Towards a general-case algorithm
In the general case, a natural heuristic seems to be: pick the bit position $i$ with the most number of $1$’s (i.e., with the highest density), and split on it. In other words:
- Find a bit position $i$ that maximizes $Delta(i)$.
- Split $S$ and $T$ based upon bit position $i$. In other words, form $S_0 = {s in S : s_i=0}$, $S_1 = {s in S : s_i = 1}$, $T_0 = {t in T : t_i = 0}$, $T_1 = {t in T : t_i = 1}$.
- Now recursively look for a non-overlapping pair from $S_0,T_0$, from $S_0,T_1$, and from $T_1,S_0$. If any recursive call finds a non-overlapping pair, output it, otherwise output “No overlapping pair exists”.
The challenge is to analyze its performance in the worst case. Let’s assume that as a pre-processing step we first compute the density of every bit position. Also, if $Delta(i) < 1/sqrt{k}$ for every $i$, assume that the pre-processing step outputs “An overlapping pair exists” (I realize that this doesn’t exhibit an example of an overlapping pair, but let’s set that aside as a separate challenge). All this can be done in $O(nk)$ time. The density information can be maintained efficiently as we do recursive calls; it won’t be the dominant contributor to running time. What will the running time of this procedure be? I’m not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about $n/sqrt{k}$ bitvectors (e.g., from $n$ bitvectors to $n-n/sqrt{k}$ bitvectors). Therefore, the recursion can only go about $sqrt{k}$ levels deep. However, I’m not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than $3^{sqrt{k}}$ leaves), so I’m not sure what running time this should lead to.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/43864