[Solved]: What techniques can I use to hand-write a parser for an ambiguous grammar?

Problem Detail: I’m writing a compiler, and I’ve built a recursive-descent parser to handle the syntax analysis. I’d like to enhance the type system to support functions as a valid variable type, but I’m building a statically typed language, and my desired syntax for a function type renders the grammar temporarily* ambiguous until resolved. I’d rather not use a parser generator, though I know that Elkhound would be an option. I know I can alter the grammar to make it parse within a fixed number of steps, but I’m more interested in how to implement this by hand. I’ve made a number of attempts at figuring out the high-level control flow, but every time I do this I end up forgetting a dimension of complexity, and my parser becomes impossible to use and maintain. There are two layers of ambiguity: a statement can be an expression, a variable definition, or a function declaration, and the function declaration can have a complex return type. Grammar subset, demonstrating the ambiguity:

basetype   : TYPE   | IDENTIFIER   ;  type   : basetype   | basetype parameter_list   ;  arguments   : arraytype IDENTIFIER   | arraytype IDENTIFIER COMMA arguments   ;  argument_list   : OPEN_PAREN CLOSE_PAREN   | OPEN_PAREN arguments CLOSE_PAREN   ;  parameters   : arraytype   | arraytype COMMA parameters   ;  parameter_list   : OPEN_PAREN CLOSE_PAREN   | OPEN_PAREN parameters CLOSE_PAREN   ;  expressions   : expression   | expression COMMA expressions   ;  expression_list   : OPEN_PAREN CLOSE_PAREN   | OPEN_PAREN expressions CLOSE_PAREN   ;  // just a type that can be an array (this language does not support // multidimensional arrays) arraytype:   : type   | type OPEN_BRACKET CLOSE_BRACKET   ;  block   : OPEN_BRACE CLOSE_BRACE   | OPEN_BRACE statements CLOSE_BRACE   ;  function_expression   : arraytype argument_list block   | arraytype IDENTIFIER argument_list block   ;  call_expression   : expression expressions   ;  index_expression   : expression OPEN_BRACKET expression CLOSE_BRACKET   ;  expression   : function_expression   | call_expression   | index_expression   | OPEN_PAREN expression CLOSE_PAREN   ;  function_statement   : arraytype IDENTIFIER argument_list block   ;  define_clause   : IDENTIFIER   | IDENTIFIER ASSIGN expression   ;  define_chain   : define_clause   | define_clause COMMA define_chain   ;  define_statement   : arraytype define_chain   ;  statement   : function_statement   | define_statement SEMICOLON   | expression SEMICOLON   ;  statements   :   | statement statements   ; 

Example parses:

// function 'fn' returns a reference to a void, parameterless function void() fn() {} // the parser doesn't know which of these are types and which are variables, // so it doesn't know until the end that this is a call_expression Object(var, var, var, var) // the parser only finds out at the end that this is a function declaration Object(var, var, var, var) fn2() {} (Object(var, var, var, var) ()) (Object(var, var, var, var) () {}) // the parser could possibly detect the "string" type and figure out that // this has to be a define statement or a function declaration, but it's // still ambiguous to the end Object(Object(string, Object), string[]) fn3() {} Object(Object(string, Object), string[]) fn4 = fn3; 

My basic approach has been to write functions that could parse the unambiguous components of this subset of the grammar, and then flatten the more complex control flow into individual functional blocks to capture state in function calls. This has proved unsuccessful, what techniques can one use to solve this kind of problem? *There is likely a better word for this

Asked By : skeggse

Answered By : babou

This answer is based on the additional information you give that your problem might be more non-determinism than ambiguity, and that you think you could then solve it if you had some lookahead arbitrarily far away. Since this case allows for quite different answer, I think it is clearer to answer it separately from my previous answer. If unbounded lookahead is all you need, then you should investigate whether RLR parsing could do the job for you. The idea of RLR is that you use a finite state automaton to get information that can be arbitrarily far away. I am not sure about all the details of the technique in its general form, but it has been implemented in some freely available parser generators. Given that you insist on a hand written parser, you might be able to apply that by hand to your own grammar. The best way to explain it is on a well known example, because many languages actually need unbounded look ahead, but deal with the problem under the rug, so that you do not notice it. The usual name for the rug is lexical analysis. Suppose the grammar includes the following rules:

U -> X Y V -> X Y id M -> U id foo N -> V bar  Z -> M | N 

where id is any identifier, while foo and bar are language keywords. and you have to parse … xxxx yyy very_long_identifier foo …. After reducing xxxx to X and yyy to Y, you have to decide whether to reduce X Y to U, or rather scan the coming identifier. It is quite visible that that depend on the keyword foo or bar that comes after the identifier. This can be easily dealt with with bounded lookahead, as the decisive keyword is only 2 tokens away when the decision is to be made. But this assumes that the identifier is considered as a single keyword. Some people assert that, since CF languages are closed under substitution by regular sets, lexical analysis is not really essential and CF parsers should be able to handle it with the rest of the CF syntax. This is true for a general CF parser (though disputable on practical grounds). But it requires some care and adaptation for classical deterministic parsing technology, because the theoretically unbounded length of identifiers raises lookahead problems. In the example above, if the identifier has not been recognized as such by a lexical analyzer, then it is just an arbitrarily long sequence of alphanumeric tokens, so that the decisive keywords, foo and bar are unboundedly far away. This shows that recognizing an intermediate structure through a pass with a simple finite state device can remove a problem such as unbounded lookahead, at least in some cases, which is all you ask for if you develop by hand a specific parser. If is for you to analyze precisely your grammar to see what can be done, but the above suggest the following approach: You state:

I cannot know definitively whether a given string of tokens represents a function statement, expression, or define statement until near the end of any of those structures in the worst case.

So the question is (up to some probably modifiable details): can you recognize that a sequence of symbols is one of these two constructs of your language, and identify which (without producing a parse-tree) by means of a simple finite state automaton (FSA)? If the answer is yes, than whenever you encounter the beginning of such a sequence, you activate the corresponding FSA to determine quickly which kind of sequence it is, and then you use the result to make your parsing decision.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/44102 3.2K people like this

 Download Related Notes/Documents

Leave a Reply