[Solved]: How do I reconstruct the forest of syntax trees from the Earley vector?

Problem Detail: Using the Earley vector as a recognizer is quite straightforward: when the end of the string is reached, you just have to check for a completed axiomatic production started at position 0. If you have at least one, then the string is accepted. Using the Earley vector to reconstruct the parsing tree(s) is less obvious. Actually, I cannot figure out how an algorithmic procedure would work, moreover the only references I found were either vague or super-technical. Could anybody shed some light on it?

Asked By : Stefano Sanfilippo

Answered By : babou

I am using terminology and notations from Earley’s paper. It is possible that the description you read is different. It seems frequent that general CF parsing algorithms are first presented in the form of a recognizer, and then the information management needed to actually build parse trees and parse forests is sort of added as an afterthought. One reason may be that keeping the information needed to construct the shared forest require cubic space $O(n^3)$ where $n$ is the length of the input string being parsed, but the space requirement is only square $O(n^2)$ for recognition, when this information is not preserved. The reason for this space complexity increase is quite simple: the parse forest size can be cubic. The worst case time complexity is $O(n^3)$, as is well known. The best reference for Earley’s algorithm is of course Earley’s paper, but it is not very explicit about building the parse forest. This can actually be a messy business, much more so than the fast talk of section 7 page 101 will let appear. To be true, Earley does not talk of parse forest, or of forest, but of “a factored representation of all possible parse trees“. And there is a good reason for that: if he tried to produce a forest according to his grammar, his space (hence time) complexity bound would climb to $O(n^{s+1})$ where $s$ is the size of the longest rule right-hand-side. This is why other algorithms use grammars in binary form (not necessarily Chomsky Normal Form (CNF)). Actually, Earley uses binary form implicitly, because that is necessary for the cubic time complexity. This is one of the major roles of the rule dot in states. But this implicit binary form produces parses and forests according to the binarized grammar, not to the original one, which, I fear, is a major source of obscurity. This is detailed further below. One good way to understand how the forest is obtained is probably to look at it in a simpler case, the CYK algorithm. It is also often described as a recognizer, and the parser aspect is added at the end. You can look at the description in wikipedia. The information needed to build the forest is what they store in the table of “backpointers”. Backpointers are essentially pointers to substrings (an associated symbol) that form the constituents of a string according to some rule. They give all possible ways of parsing a substring. Recall that CYK uses binary form, usually CNF, so that things are simpler. The CYK parser has fundamentally the same dynamic programming structure as Earley, but is much simpler. So understanding it well can be a significant help. Going back to Earley’s algorithm, I do not believe you need Earley vector to decide acceptance or to build parse trees and forests. What Earley calls vector in his paper appears only in page 97, in third paragraph of implementation. It is only a device to speed up the search of states pointing back at some given string position k, in order to get a better complexity. But all the information is in the state sets, implemented as lists of states. However, this information is not sufficient to build the forest of parse trees, because the algorithm does not keep track of the way(s) a state may be obtain. Indeed, the vector is even used to efficiently discard a state already found, independently of how it was found. In section 7 of Earley’s article, he explains that in order to “make the recognizer into a parser”, i.e. to be able to recover parse trees, it is necessary to keep track of the way completions are done.

Each time we perform the completer operation adding a state $Erightarrow alpha D.beta ; g$ (ignoring lookahead) we construct a pointer from the instance of $D$ in that state to the state $Drightarrow gamma. ; f$ which caused us to do the operation. This indicates that $D$ was parsed as $gamma$. In case D is ambiguous there will be a set of pointers from it, one for each completer operation which caused $Erightarrow alpha D.beta ; g$ to be added to the particular state set. Each symbol in $gamma$ will also have pointers from it (unless it is terminal), and so on, thus representing the derivation tree for $D$.

Note that in this text, $f$ and $g$ are indices in the parsed string, pointing where the recognition of the rule left-hand side started (as the right-hand side symbol had been predicted. So $f$ is the string index where the recognition of $Drightarrow gamma$ started, and it ended at index $g$. These “completion pointers” are the Earley equivalent of the backpointers described (not too well in wikipedia) for the parser version of CYK. From such a pointer (as described in the quote) we know that the $D$ in the rule instance $Erightarrow alpha D.beta ; g$ can itself be developped into a tree (or forest) that parses the input string $w$ from index $f+1$ to index $g$, which we note $w_{f+1:g}$. The nodes immediately below $D$ are given by the rule $Drightarrow gamma$. By looking for the completion that lead to $Drightarrow gamma. ; f$ we can then find other such pointers that tell how the last symbol of $D$ was obtained, and hence more information on the possible parse trees. Also by looking at the completion that recognized the symbol before last in an earleir state sets, you find how it was obtained, and so on. Assuming you kept all the needed pointers as indicated in the paper, you can get all the shared tree representations starting from the last symbol recognized by the parser, which is of course the initial symbol of the grammar. But I also skipped the messy part. Suppose you have a rule $Urightarrow XYZ$, which I choose with a right hand side longer than 2 symbols, and another rule $Wrightarrow UV$, for an ambiguous grammar. It may well happen that the parser will parse $w_{f+1:g}$ into $X$, $w_{g+1:h}$ into $Y$ and both $w_{h+1:i}$ and $w_{h+1:j}$ into $Z$. So, with the rule $Urightarrow XYZ$, Both $w_{f+1:i}$ and $w_{f+1:j}$ parse into $U$. Then it may also be that both $w_{i+1:k}$ and $w_{j+1:k}$ both parse into $V$. Then, with the rule $Wrightarrow UV$, the string $w_{f+1:k}$ parse into $W$ in two different way, which correspond to an ambiguity of the grammar. Of course, to avoid repeating computations, Earley’s algorithm will attempt to share as much as possible of the two parsing computations. What it will actually share is obviously the recognition (and parsing) of $w_{f+1:g}$ and $w_{g+1:h}$ into $X$ and $Y$. But it will actually do a bit more: it will also share the beginning of the two distinct parses that recognize $U$ with the rule $Urightarrow XYZ$. What I mean is that the state $Urightarrow XY.Z ; f$ will be found only once (with respect to what I am describing), in state set $S_h$. It will be a common part of the two parses. Of course, things will temprarily diverge while parsing the $Z$ since they correspond to distict substrings, until they converge again when everything parses into W, when the state $Wrightarrow UV. ; f$ is produced twice in state set $S_k$. So the forest of syntax trees can be a very strange one, with kind of siamese twin subtrees that may share the first two edges of some node, but not the third edge. In other words, it may be a very awkward structure. This may explain why Earley calls it “a factored representation of all possible parse trees“, without being more specific. Any attenpt to surgically separate the siamese twins, without changing the grammar will result in increased complexity. The right way to do it is to binarize the grammar. I hope this will help you. Let me know. But I do insist that a good understanding of CYK parsing can help. There are other algorithms, simpler than Earley’s, that can parse all CF languages efficiently. You may find more general info on this parse forest issue in two other answers I gave: http://cstheory.stackexchange.com/questions/7374#18006 and http://linguistics.stackexchange.com/questions/4619#6120. But they do not go into specific details of Earley’s algorithm.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/27937

Asked By : Stefano Sanfilippo

Answered By : babou

Best Answer from StackOverflow

Related