Asked By : Stefano Sanfilippo
Answered By : babou
Each time we perform the completer operation adding a state $Erightarrow alpha D.beta ; g$ (ignoring lookahead) we construct a pointer from the instance of $D$ in that state to the state $Drightarrow gamma. ; f$ which caused us to do the operation. This indicates that $D$ was parsed as $gamma$. In case D is ambiguous there will be a set of pointers from it, one for each completer operation which caused $Erightarrow alpha D.beta ; g$ to be added to the particular state set. Each symbol in $gamma$ will also have pointers from it (unless it is terminal), and so on, thus representing the derivation tree for $D$.
Note that in this text, $f$ and $g$ are indices in the parsed string, pointing where the recognition of the rule left-hand side started (as the right-hand side symbol had been predicted. So $f$ is the string index where the recognition of $Drightarrow gamma$ started, and it ended at index $g$. These “completion pointers” are the Earley equivalent of the backpointers described (not too well in wikipedia) for the parser version of CYK. From such a pointer (as described in the quote) we know that the $D$ in the rule instance $Erightarrow alpha D.beta ; g$ can itself be developped into a tree (or forest) that parses the input string $w$ from index $f+1$ to index $g$, which we note $w_{f+1:g}$. The nodes immediately below $D$ are given by the rule $Drightarrow gamma$. By looking for the completion that lead to $Drightarrow gamma. ; f$ we can then find other such pointers that tell how the last symbol of $D$ was obtained, and hence more information on the possible parse trees. Also by looking at the completion that recognized the symbol before last in an earleir state sets, you find how it was obtained, and so on. Assuming you kept all the needed pointers as indicated in the paper, you can get all the shared tree representations starting from the last symbol recognized by the parser, which is of course the initial symbol of the grammar. But I also skipped the messy part. Suppose you have a rule $Urightarrow XYZ$, which I choose with a right hand side longer than 2 symbols, and another rule $Wrightarrow UV$, for an ambiguous grammar. It may well happen that the parser will parse $w_{f+1:g}$ into $X$, $w_{g+1:h}$ into $Y$ and both $w_{h+1:i}$ and $w_{h+1:j}$ into $Z$. So, with the rule $Urightarrow XYZ$, Both $w_{f+1:i}$ and $w_{f+1:j}$ parse into $U$. Then it may also be that both $w_{i+1:k}$ and $w_{j+1:k}$ both parse into $V$. Then, with the rule $Wrightarrow UV$, the string $w_{f+1:k}$ parse into $W$ in two different way, which correspond to an ambiguity of the grammar. Of course, to avoid repeating computations, Earley’s algorithm will attempt to share as much as possible of the two parsing computations. What it will actually share is obviously the recognition (and parsing) of $w_{f+1:g}$ and $w_{g+1:h}$ into $X$ and $Y$. But it will actually do a bit more: it will also share the beginning of the two distinct parses that recognize $U$ with the rule $Urightarrow XYZ$. What I mean is that the state $Urightarrow XY.Z ; f$ will be found only once (with respect to what I am describing), in state set $S_h$. It will be a common part of the two parses. Of course, things will temprarily diverge while parsing the $Z$ since they correspond to distict substrings, until they converge again when everything parses into W, when the state $Wrightarrow UV. ; f$ is produced twice in state set $S_k$. So the forest of syntax trees can be a very strange one, with kind of siamese twin subtrees that may share the first two edges of some node, but not the third edge. In other words, it may be a very awkward structure. This may explain why Earley calls it “a factored representation of all possible parse trees“, without being more specific. Any attenpt to surgically separate the siamese twins, without changing the grammar will result in increased complexity. The right way to do it is to binarize the grammar. I hope this will help you. Let me know. But I do insist that a good understanding of CYK parsing can help. There are other algorithms, simpler than Earley’s, that can parse all CF languages efficiently. You may find more general info on this parse forest issue in two other answers I gave: http://cstheory.stackexchange.com/questions/7374#18006 and http://linguistics.stackexchange.com/questions/4619#6120. But they do not go into specific details of Earley’s algorithm.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/27937