# Symbolic Regression

## Symbolic Regression

DataDrivenDiffEq.jl provides an interface to SymbolicRegression.jl to `solve`

a `DataDrivenProblem`

:

```
using DataDrivenDiffEq, LinearAlgebra, Random
using SymbolicRegression
Random.seed!(1223)
# Generate a multivariate function for SymbolicRegression
X = rand(2,20)
f(x) = [sin(x[1]); exp(x[2])]
Y = hcat(map(f, eachcol(X))...)
# Define the options
opts = EQSearch([+, *, sin, exp], maxdepth = 1, progress = false, verbosity = 0)
# Define the problem
prob = DirectDataDrivenProblem(X, Y)
# Solve the problem
res = solve(prob, opts, numprocs = 0, multithreading = false)
sys = result(res)
```

```
i = 1
i = 2
Model ##Basis#512 with 2 equations
States : x[1] x[2]
Independent variable: t
Equations
φ₁ = sin(x[1])
φ₂ = exp(x[2])
```

`Solve`

can be used with `EQSearch`

, which wraps `Options`

provided by SymbolicRegression.jl. Additional keyword arguments are `max_iter = 10`

, which defines the number of iterations, `weights`

which weight the measurements of the dependent variable (e.g. `X`

, `DX`

or `Y`

depending on the `DataDrivenProblem`

), `numprocs`

which indicates the number of processes to use, `procs`

for use with manually setup processes, `multithreading = false`

for multithreading and `runtests = true`

which performs initial testing on the environment to check for possible errors. This setup mimics the behaviour of `EquationSearch`

.

## OccamNet

As introduced in Interpretable Neuroevolutionary Models for Learning Non-Differentiable Functions and Programs , `OccamNet`

is a special form of symbolic regression which uses a probabilistic approach to equation discovery by using a feedforward multilayer neural network. In contrast to normal architectures, each layer's weights reflect the probability of which inputs to use. Additionally a set of activation functions is used, instead of a single function. Similar to simulated annealing, a temperature is included to control the exploration of possible functions.

`DataDrivenDiffEq`

offers two main interfaces to `OccamNet`

: a `Flux`

based API with `Flux.train!`

and a `solve(...)`

function.

Consider the following example, where we want to discover a vector valued function:

```
using DataDrivenDiffEq, LinearAlgebra, ModelingToolkit, Random
using Flux
Random.seed!(1223)
# Generate a multivariate dataset
X = rand(2,10)
f(x) = [sin(π*x[2]+x[1]); exp(x[2])]
Y = hcat(map(f, eachcol(X))...)
```

```
2×10 Matrix{Float64}:
0.987452 0.183491 0.413858 0.536896 … 0.143889 0.943865 0.728553
1.3638 2.1489 1.07224 1.95 2.51864 1.31227 1.13993
```

Next, we define our network:

`net = OccamNet(2, 2, 3, Function[sin, +, *, exp], skip = true, constants = Float64[π])`

`OccamNet(4, Constants 1, Parameters 0)`

Where `2,2,3`

refers to input and output dimension and the number of layers *without the output layer*. We also define that each layer uses the functions `sin, +, *, exp`

as activations and uses a `π`

as a constant, which gets concatenated to the input data. Additionally, `skip`

indicates the usage of skip connections, which allow the output of each layer to be passed onto the output layer directly.

To train the network over `100`

epochs using `ADAM`

, we type

`Flux.train!(net, X, Y, ADAM(1e-2), 100, routes = 100, nbest = 3)`

Under the hood, we select possible routes, `routes`

, through the network based on the probability reflected by the `ProbabilityLayer`

forming the network. From these we use the `nbest`

candidate routes to train the parameters of the network, which increases the probability of those routes.

Lets have a look at some possible equations after the initial training. We can use `rand`

to sample a route through the network, compute the output probability with `probability`

and transform it into analytical equations using `ModelingToolkit`

variables as input. The call `net(x, route)`

uses the route to compute just the elements on this path.

```
@variables x[1:2]
for i in 1:10
route = rand(net)
prob = probability(net, route)
eq = simplify.(net(x, route))
print(eq , " with probability ", prob, "\n")
end
```

```
Symbolics.Num[sin(sin(x[1])), exp(x[1])] with probability [0.01102929262864759, 0.010139807174587864]
Symbolics.Num[sin(x[1]), 6.283185307179586x[1]] with probability [0.022306805446838676, 0.00023863738120204818]
Real[1.2246467991473532e-16x[2], 3.141592653589793] with probability [3.9638065778767276e-5, 0.03692562362849779]
Symbolics.Num[sin(x[2]), 19.739208802178716] with probability [0.011799166359636024, 7.419043758394218e-5]
Symbolics.Num[x[2], exp(x[2])] with probability [0.03576425085484868, 0.05502637077718041]
Symbolics.Num[x[1], exp((3.141592653589793 + x[2])*x[1])] with probability [0.1390308364783759, 1.243774256141111e-5]
Symbolics.Num[exp(sin(x[2])), exp(sin(x[2]))] with probability [0.0007419304380345507, 0.0037001826251743314]
Symbolics.Num[3.141592653589793sin(x[1]), exp(x[2])] with probability [0.0002970862837608978, 0.13370311014829103]
Symbolics.Num[sin(3.141592653589793 + sin(x[2])), 1.1216958622467619e10] with probability [3.244367439425053e-5, 0.0006415609290187295]
Symbolics.Num[2x[1], 2x[1]] with probability [0.016105816904692993, 0.008367618701870319]
```

We see the networks proposals are not very certain. Hence, we will train for some more epochs and look at the output again.

```
Flux.train!(net, X, Y, ADAM(1e-2), 900, routes = 100, nbest = 3)
for i in 1:10
route = rand(net)
prob = probability(net, route)
eq = simplify.(net(x, route))
print(eq , " with probability ", prob, "\n")
end
```

```
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.9101007100928445, 0.8850706726604041]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.9101007100928445, 0.8850706726604041]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.9101007100928445, 0.8850706726604041]
Symbolics.Num[sin(3.141592653589793x[2] + exp(x[2])), exp(x[2])] with probability [0.004030565957195384, 0.8850706726604041]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.9101007100928445, 0.8850706726604041]
Symbolics.Num[sin(sin(x[1]) + x[1]), exp(x[2])] with probability [0.0039708226514672575, 0.08116397633914071]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.9101007100928445, 0.8850706726604041]
Symbolics.Num[sin(x[1]), exp(x[2])] with probability [0.0020833758379985145, 0.8850706726604041]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.9101007100928445, 0.8850706726604041]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.9101007100928445, 0.8850706726604041]
```

The network is quite certain about the equation now, which is in fact our unknown mapping. To extract the solution with the highest probability, we set the temperature of the underlying distribution to a very low value. In the limit of `t ↦ 0`

we approach a Dirac distribution, hence extracting the most likely terms.

```
set_temp!(net, 0.01)
route = rand(net)
prob = probability(net, route)
eq = simplify.(net(x, route))
print(eq , " with probability ", prob, "\n")
```

`Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [1.0, 1.0]`

The same procedure is automated in the `solve`

function. Using the same data, we wrap the algorithm's information in the `OccamSR`

struct and define a `DataDrivenProblem`

:

```
# Define the problem
ddprob = DirectDataDrivenProblem(X, Y)
# Define the algorithm
sr_alg = OccamSR(functions = Function[sin, +, *, exp], skip = true, layers = 3, constants = [π])
# Solve the problem
res = solve(ddprob, sr_alg, ADAM(1e-2), max_iter = 1000, routes = 100, nbest = 3)
```

```
Nonlinear Solution with 2 equations and 0 parameters.
Returncode: 0.8231749776476097
L₂ Norm error : [0.0, 0.0]
AIC : [Inf, Inf]
```

Within `solve`

, a network is generated using the information provided by the `DataDrivenProblem`

(states, control, independent variables, and the specified options). Then the network is trained, and finally the equation with the highest probability is extracted by setting the temperature as above. After computing additional metrics, a `DataDrivenSolution`

is returned where the equations are transformed into a `Basis`

usable with `ModelingToolkit`

.

The metrics can be accessed via:

`metrics(res)`

```
Dict{Any, Any} with 2 entries:
:AIC => [Inf, Inf]
:L₂ => [0.0, 0.0]
```

and the resulting `Basis`

by:

`result(res)`

```
Model ##Basis#515 with 2 equations
States : x[1] x[2]
Independent variable: t
Equations
φ₁ = sin(3.141592653589793x[2] + x[1])
φ₂ = exp(x[2])
```

Right now, the resulting basis is not using parameters, but raw numerical values.