How to simulate backreferences, lookaheads, and lookbehinds in finite state automata?

Problem Detail: I created a simple regular expression lexer and parser to take a regular expression and generate its parse tree. Creating a non-deterministic finite state automaton from this parse tree is relatively simple for basic regular expressions. However I can’t seem to wrap my head around how to simulate backreferences, lookaheads, and lookbehinds. From what I read in the purple dragon book I understood that to simulate a lookahead $r/s$ where the regular expression $r$ is matched if and only if the match is followed by a match of the regular expression $s$, you create a non-deterministic finite state automaton in which $/$ is replaced by $varepsilon$. Is it possible to create a deterministic finite state automaton that does the same? What about simulating negative lookaheads and lookbehinds? I would really appreciate it if you would link me to a resource which describes how to do this in detail.

Asked By : Aadit M Shah

Answered By : Raphael

First of all, backreferences can not be simulated by finite automata as they allow you to describe non-regular languages. For example, ([ab]^*)1 matches ${ww mid w in {a,b}^*}$, which is not even context-free. Look-ahead and look-behind are nothing special in the world of finite automata as we only match whole inputs here. Therefore, the special semantic of “just check but don’t consume” is meaningless; you just concatenate and/or intersect checking and consuming expressions and use the resulting automata. The idea is to check the look-ahead or look-behind expressions while you “consume” the input and store the result in a state. When implementing regexps, you want to run the input through an automaton and get back start and end indices of matches. That is a very different task, so there is not really a construction for finite automata. You build your automaton as if the look-ahead or look-behind expression were consuming, and change your index storing resp. reporting accordingly. Take, for instance, look-behinds. We can mimic the regexp semantics by executing the checking regexp concurrently to the implicitly consuming “match-all” regexp. only from states where the look-behind expression’s automaton is in a final state can the automaton of the guarded expression be entered. For example, the regexp /(?=c)[ab]+/ (assuming ${a,b,c}$ is the full alphabet) — note that it translates to the regular expression ${a,b,c}^*c{a,b}^+{a,b,c}^*$ — could be matched by enter image description here

[source] and you would have to

store the current index as $i$ whenever you enter $q_2$ (initially or from $q_2$) and
report a (maximum) match from $i$ to the current index ($-1$) whenever you hit (leave) $q_2$.

Note how the left part of the automaton is the parallel automaton of the canonical automata for [abc]* and c (iterated), respectively. Look-aheads can be dealt with similarly; you have to remember the index $i$ when you enter the “main” automaton, the index $j$ when you leave the main automaton and enter the look-ahead automaton and report a match from $i$ to $j$ only when you hit the look-ahead automaton’s final state. Note that non-determinism is inherent to this: main and look-ahead/-behind automaton might overlap, so you have to store all transitions between them in order to report the matching ones later, or backtrack.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/2557

Asked By : Aadit M Shah

Answered By : Raphael

Best Answer from StackOverflow

Related