[Solved]: Why are syntax trees used in genetic programming?

Problem Detail: Reading a course on genetic programming, the first chapter describes the syntax tree as the basic representation of programs in genetic programming. What are the reasons leading to the choice of a syntax tree in genetic programming? Are there any particular assets in using this representation?

Asked By : Chirac

Answered By : babou

In short: AST representations of programs are more easily analyzed, manipulated, and transformed, while preserving and enforcing the existence of a formally defined program meaning through the transformations. I am asssuming that the reader knows already what is an abstract syntax tree (AST), and does not need to be shown examples. The question is an issue about abstract objects (semantics) and their representations (syntax). We are usually interested in the semantics, applying transformations to entities to change or combine their meanings, their semantics, which can belong to all kinds of domains. But semantics is elusive, mathematical abstraction, and all we can actually manipulate to express our intent is syntax representations. That is the basis of all programming, but is also to be found in logic, proofs, and finally all of mathematics. That is what makes the choice of representations and notations often so important. Good choice of representation will lead to easier, possibly more perspicuous or intuitive understanding of the problem and be better adapted at expressing answers, and proofs, for given questions. The point is well known to programmers, as they are taught that it is essential to choose the right data structure in order to implement an algorithmic solution to a given problem. The choice of data structure is the choice of representation. And I just made a similar answer to a question on constructing (programming?) Turing Machines. Note that there is a partially ambiguous use of the word syntax, which can be any concrete representation of an abstraction, or can be intended to mean only textual representation as strings (we are getting close to Turing machines), possibly structured by some logical system (a context-free (CF) grammar for example that specifies what are the legitimate strings, the well-formed-strings. In fact, the existing parsing-unparsing technology, especially for CF grammars, makes it generally easy to switch between textual and an associated tree representation, to the point that they are often not distinguished much. Actually, mathematicians have been using tree representation (formally defined as Algebras), long before that technology existed, and this is what lead to using the representation of AST in manipulating programs. So what was so convenient about these tree (or AST) representations? The key points are:

it is very convenient for associating precisely semantic meaning to syntactic constructs, in a compositional way, by means of a simple mathematical concept: the homomorphism;
it provides and enforces a syntactic type system (statement, expression, …) that preserve the existence of some meaning when the program representation is transformed in a way that preserves the syntactic typing constraints;
it is easily manipulated to actually perform analysis and transformations;
it is easy to decorate it with localized information to help or guide these analyses and transformations.

First, if you take a string representation of a program, without further structural information, there is no simple way to delimit substrings that have naturally a meaning of their own (I am saying naturally, because some people may always try to play at abstracting syntactic context or indulge in syntactic (text oriented) continuation games – ignore this if you do not get it). The idea of a tree structure, with nodes labeled with operators, possibly restricted by syntactic types (statement, expression, variable, …) is that they naturally decompose the syntactic structure into subparts (the subtrees) that can have fairly naturally, and perspicuously with respect to computational concepts, a semantic meaning of their own. Then, if we associate a semantic function to each operator, we can get the meaning of a tree representation by applying the semantic function of the root operators to the semantic meaning of each of its subtrees. If we also provide some semantics to leafs operators (which may be defined as a mapping from syntactic representations to some semantic domains possibly by other means, for example associating the integer 23 to the string “23”), we know how to define simply the semantics of any well-formed tree, whether complete program or well-formed program fragment. This is the basis of what is known as denotational semantics. In other word, the homomorphism thus defined gives a rather simple and tractable way to associate meaning to well-formed program fragments, by composing the meanings of its well-formed sub-fragments (this is known as semantic compositionality). Then, it becomes much easier to attempt defining semantically meaningful transformations. And semantics, what the program (fragment) does is what we really care about. Furthemore, AST are fairly easy to manipulate by programs to express these transformations that somewhat respect the semantic structure (as they manipulate subtrees that are semantically meaningful), which initially justified their use in program editors, program manipulation systems, and programming environment. Actually, the concept of AST in programming appeared with the language Lisp (and its ability to manipulate Lisp programs syntactically) in the very late fifties. But it developed mostly in the seventies (Emily, Cornell synthesizer, Mentor/Centaur), and further in the eighties, at the same time as denotational semantics (which is based on the above approach), and most likely under the influence of denotational semantics. The early work (LCF) on automated proofs about computation may also have been influencial. From the point of view of genetic programming, the use of AST both facilitates and enforces the creation of syntactic structures that are more likely to have some meaning. Manipulating unstructured string representations would be likely to result in being swamped with string representations to which no meaning can be associated. Another advantage of AST representations is that they are fairly easy to decorate with additional information: precomputed semantic properties, weights, or whatever is deemed useful for the intended program transformation or manipulation, including genetically programed transformations.

A detailed example

This section has been added later to answer some comments. I will try to work out one example in some details to explain the relation between manipulations on the AST and their semantics conterpart. We take a very simple example of crossover between two ASTs: T1=foo(exp1,exp2) and T2=bar(exp3, exp4). I keep it small for readability. Actually, exp1 to exp4 are meant to be subtrees of some AST, while you may also see what is around them as standing for the rest of some AST (or some AST subtree) that is the context in which they occur. We suppose first that the language does not have typing (in the usual sense of programing languages), and accept any value in any context expecting a value, so that any expression can legitimately replace any other. Then given T1 and T2, it is legitimate to apply a crossover that swaps exp1 and exp4, thus producing T1’=foo(exp4,exp2) and T2’=bar(exp3, exp1). Note that the algorithm could also consider only one of these two trees (I am incompetent regarding genetic splicing strategies). Now, let us see what can be said about semantics, without getting too much into details. Let $S$ be the semantic function, $V$ the domain of values, $M$ the domain of environments that map identifiers to values. The semantics $S$(exp) of an expression exp is a function $e: Mto V$ that takes an environment $m$ as argument (so that we have values for identifiers) and returns the value of the expression in that environment. Now, if the operators foo and bar are supposed to compute respectively two functions $f$ and $b$ on their arguments, the semantics of T1, for example, will be defined as $S$(foo(exp1,exp2))=$S$(foo)($S$(exp1),$S$(exp2))=$lambda m.f(e_1(m),e_2(m))$ You notice that the semantic function is not necessarily easy to define in the complex case of a programming language, since for foo we have here $S$(foo)=$lambda e, e’, m,., f(e(m), e'(m))$. This complexity results from the choice of the Abstract Syntax, that considers here foo and bar as AST operator nodes. Instead, the AST could have a call operator that has several daughters, for example as in call(foo, exp1, exp2). Then the complexity of dealing with the environment $m$ would be factorized in the semantics of call, while the semantics of foo would simply be the function $f$. Sorry for this complexity. The whole point is that we can define the semantics of an AST subtree such as exp1 independently of any context, but as a function $e_1$. Whatever information it needs from the context to be evaluated is summarized in the arguments (here the environment $m$) that are passed to its semantics. And this information is in turn passed to the semantic functions associated to subexpressions. So now we know that a subtree of AST can have a well defined semantics, independently of its context. But what about the context? Does it have a well defined semantics when a subtree is missing. The nice point is that it does. If you consider the context foo(??,exp2), what can its semantics be. The natural answer is that it is whatever semantics it would have if you provided the missing part. In other words, it is a functional semantics that takes as argument the semantics of the missing part. This is very similar to the semantics of the complete expression $S$(foo(exp1,exp2)) that we defined above, except that the missing exp1 must be accounted for with an argument $e$ standing for the semantics of whatever expression could replace exp1. We had $S$(foo(exp1,exp2))=$lambda m.f(e_1(m),e_2(m))$ So, without going into further details, you have $S$(foo(??,exp2))=$lambda e.lambda m.f(e(m),e_2(m))$ Then, given that $S$(exp1)=$e_1: Mto V$, we have
$S$(foo(exp1,exp2))= $S$(foo(??,exp2))($S$(exp1)) which is precisely what we would like: syntactic abstraction directly translate into semantic abstraction. Then we also have, of course: $S$(foo(exp4,exp2))= $S$(foo(??,exp2))($S$(exp4)) In other words, both AST subtrees and AST contexts (of subtree) have well defined semantics, that do not change, and they compose to give the semantics of the whole AST tree when a subtree is placed in a permitted context. But of course: $S$(foo(exp1,exp2)) $neq$ $S$(foo(exp4,exp2)) I hope this explains how things work. Using well-formed AST will ensure that you are actually manipulating semantic fragments that compose meaningfully. Actually, things may be a bit more complicated, because you want to include in the process what is usually called static semantics, i.e. constraints on the AST that are usually checked at compile time. The best known examples are type constraints for statically typed languages, or declaration of identifiers when that is required. Thus you may want to decorate ASTs with this statically computable information, and use splicing techniques that will incrementally preserve the information and check the required constraints on the AST, so that you produce only programs that at least compile. Then as a last comment, I would like to remarks that linear genetic programming seems to be doing pretty much the same, but restricts splicing to sequences of statements, so that there is always some executable semantics preserved. But I am not specialist of that.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/42744 Ask a Question Download Related Notes/Documents