Given two (finite) sets $A$ and $B$ of strings over some alphabet $Sigma$, is there a regular expression of length $leq k$ that accepts every string in $A$ and rejects every string in $B$?
Is anything known about the complexity of this particular separation problem? (Note that since I’ve specified $A$ and $B$ as finite sets of strings, the natural notion of size for the problem is the total lengths of all strings in $A$ and $B$; this swamps any contribution from $k$). It seems highly likely to me that it is NP-complete (and in fact, I would expect the reduction to be to some sort of cover problem) but a few searches haven’t turned up anything particularly useful.
Asked By : Steven Stadnicki
Answered By : FrankW
- letters from $Sigma$, matching themselves,
- $+$, denoting union,
- $cdot$, denoting concatenation,
- $*$, denoting Kleene-Star,
- $lambda$, matching the empty string
and nothing else. Length of a regex is defined as the number of characters from $Sigma$. As in the comic strip, we consider a regex to match a word, if it matches a substring of the word. (Changing any of these assumptions should only influence the complexity of the construction below, but not the general result.) That it is in NP is straightforward, as explained in the comments (verify a candidate-RE by translating it into an NFA and running that on all words from $A$ and $B$). In order to show NP-hardness, we reduce Set cover:
Given a universe $U$ and a collection $C$ of subsets of $U$, is there a set $C’ subseteq C$ of size $leq k$ so that $bigcup_{S in C’} S = U$?
We translate an input for Set cover into one for regex golf as follows:
- $Sigma$ contains one character for each subset in $C$ and one additional character (denoted $x$ in the following).
- $A$ contains one word for each element $e$ of $U$. The word consists of exactly the characters representing subsets in $C$ that contain $e$ (in arbitrary order).
- $B$ contains the single word $x$.
- $k$ is simply carried over.
This reduction is obviously in P and equivalence is also quite simple to see:
- If $c_1, ldots, c_k$ is a solution for the set cover instance, the regex $c_1 + cdots + c_k$ is a solution to regex golf.
- A regex matching the empty subword would match $x$. Thus, any regex solving the golf problem has to contain at least one letter from each of the words in $A$. Hence, if the golf instance is solvable, there is a set of at most $k$ letters from $Sigma$ so that each word in $A$ is covered by this set of letters. By construction, the corresponding set of subsets from $C$ is a solution to the set cover instance.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/19686