## [EN] Genetic Programming

So far, we used individuals with a relatively simple encoding - in most cases the individual was an array of some values. The most complex individual was used for the classification rules - the arrays could have different length for different individuals. There are, however, evolutionary algorithms that use even more complex individuals. The typical example is tree-based genetic programming.

## Individual and Fitness

In genetic programming, the individuals encode expression trees that can be interpreted as simple programs, or that can encode a more complex structure (like an electronic circuit). In the most simple case, the tree is just a single expression (for example an arithmetic one). In this case, genetic programming can be used to solve the problem of symbolic regression. We are given a set of inputs and outputs and the goal is to find an expression that, when applied to the inputs, produces the correct outputs. Typically, the quality of the expression is measured by the mean square error of the outputs for given inputs (the mean of the squares of the differences between the desired output and the actual output).

## Terminals and Non-Terminals

The individuals (trees) are composed of so called terminals and non-terminals. Terminals are the leafs of the tree and contain inputs or constants in the expression. Non-terminals, on the other hand, are the internal nodes and contain functions. For symbolic regression, we will typically have the common mathematical operations like addition, subtraction, multiplication, etc., and, in some cases other interesting function (sine, cosine, exponential, ...). We can however also have other functions, like if conditions (or "if less", "if less than 0" etc.). The specific functions used as non-terminals is problem specific. If we do not use all the needed functions, GP may not be able to find a reasonable solution, on the other hand, using more functions than needed makes the optimization harder.

## Initialization and Genetic Operators

As we already saw, with more complex individuals come more complex methods for initialization and more complex (and varied) genetic operators. In GP the random initial individuals are random trees. There are two basic methods to generate them - either a random tree with a given number of nodes is generated (sometimes called "grow"), or random tree with a given depth is generated ("full"). The most common method however is a combination of these two - half of the population is generated by "grow" and half is generated by "full". This method is called "ramped half-and-half".

In genetic programming, typically we use relatively large number of genetic operators. The most common crossover operator exchanges random subtrees between two individuals. There are a few variants of this operator- one of them for example ensures that the exchanged subtrees have similar size or depth, this ensures that the resulting trees will not be too large. There is quite a large number of different mutation operators commonly used in GP. The more common ones change a node in the tree for another node (we have to be careful and change non-terminals only non-terminals with the same number of arguments). Another option is to remove a subtree and generate a new random one instead. One of the goals of the mutation may also be to make the expression smaller - in such a case a subtree can be replaced by a smaller subtree or even a terminal node.

## Deap Implementation

The deap library has a number of useful methods for genetic programming, all the above-mentioned techniques for initialization and all the mentioned genetic operators are implemented in deap. In order to implement GP in deap, we have to specify the primitive set (terminals and non-terminals). Here we can specify any functions as non-terminals, even custom ones. However, we have to be careful, the functions must be defined for any inputs. For example, if we implement division, we need to somehow ensure what should happen if the divisor is 0. The rest of the implementation is similar to those I showed you last time, all the utilities for genetic programming are in the `deap.gp`

module. You can see an example of GP in deap implemented in the `deap_gp.py`

file in the repository.

The last few lines in the file show, how to extract the values from the deap log structure and save the logs in the same format we used so far. I also made a slight fix in the utils file, so that it now works even if the number of fitness evaluations is different in each generation (deap evaluates only changed individuals).