Symbolic Regression

Symbolic Regression

DataDrivenDiffEq.jl provides an interface to SymbolicRegression.jl to solve a DataDrivenProblem:

using DataDrivenDiffEq, LinearAlgebra, Random
using SymbolicRegression


Random.seed!(1223)
# Generate a multivariate function for SymbolicRegression
X = rand(2,20)
f(x) = [sin(x[1]); exp(x[2])]
Y = hcat(map(f, eachcol(X))...)

# Define the options
opts = EQSearch([+, *, sin, exp], maxdepth = 1, progress = false, verbosity = 0)

# Define the problem
prob = DirectDataDrivenProblem(X, Y)

# Solve the problem
res = solve(prob, opts, numprocs = 0, multithreading = false)
sys = result(res)
Model ##Basis#80520 with 2 equations
States : x[1] x[2]
Independent variable: t
Equations
φ₁ = sin(x[1])
φ₂ = exp(x[2])

Solve can be used with EQSearch, which wraps Options provided by SymbolicRegression.jl. Additional keyword arguments are max_iter = 10, which defines the number of iterations, weights which weight the measurements of the dependent variable (e.g. X, DX or Y depending on the DataDrivenProblem), numprocs which indicates the number of processes to use, procs for use with manually setup processes, multithreading = false for multithreading and runtests = true which performs initial testing on the environment to check for possible errors. This setup mimics the behaviour of EquationSearch.

OccamNet

As introduced in Interpretable Neuroevolutionary Models for Learning Non-Differentiable Functions and Programs , OccamNet is a special form of symbolic regression which uses a probabilistic approach to equation discovery by using a feedforward multilayer neural network. In contrast to normal architectures, each layer's weights reflect the probability of which inputs to use. Additionally a set of activation functions is used, instead of a single function. Similar to simulated annealing, a temperature is included to control the exploration of possible functions.

DataDrivenDiffEq offers two main interfaces to OccamNet: a Flux based API with Flux.train! and a solve(...) function.

Consider the following example, where we want to discover a vector valued function:

using DataDrivenDiffEq, LinearAlgebra, ModelingToolkit, Random
using Flux

Random.seed!(1223)

# Generate a multivariate dataset
X = rand(2,10)
f(x) = [sin(π*x[2]+x[1]); exp(x[2])]
Y = hcat(map(f, eachcol(X))...)
2×10 Matrix{Float64}:
 0.730729  0.999265  0.969019  0.625678  …  0.669184  0.95687  0.92499
 1.04471   1.29513   1.69627   1.74007      1.5878    1.36212  1.20932

Next, we define our network:

net = OccamNet(2, 2, 3, Function[sin, +, *, exp], skip = true, constants = Float64[π])
OccamNet(4, Constants 1, Parameters 0)

Where 2,2,3 refers to input and output dimension and the number of layers without the output layer. We also define that each layer uses the functions sin, +, *, exp as activations and uses a π as a constant, which gets concatenated to the input data. Additionally, skip indicates the usage of skip connections, which allow the output of each layer to be passed onto the output layer directly.

To train the network over 100 epochs using ADAM, we type

Flux.train!(net, X, Y, ADAM(1e-2), 100, routes = 100, nbest = 3)

Under the hood, we select possible routes, routes, through the network based on the probability reflected by the ProbabilityLayer forming the network. From these we use the nbest candidate routes to train the parameters of the network, which increases the probability of those routes.

Lets have a look at some possible equations after the initial training. We can use rand to sample a route through the network, compute the output probability with probability and transform it into analytical equations using ModelingToolkit variables as input. The call net(x, route) uses the route to compute just the elements on this path.

@variables x[1:2]

for i in 1:10
  route = rand(net)
  prob = probability(net, route)
  eq = simplify.(net(x, route))
  print(eq , " with probability ",  prob, "\n")
end
Symbolics.Num[sin(exp(x[2])), exp(x[2]) + x[2]] with probability [0.02874208875253813, 0.001020356452889173]
Symbolics.Num[sin((3.141592653589793 + x[1])*x[1]), exp(x[2])] with probability [2.4268035492071496e-5, 0.10004718259206825]
Symbolics.Num[exp(x[2]), exp(x[2])] with probability [0.010309792918739237, 0.011230891123571646]
Symbolics.Num[x[1], 3.141592653589793 + x[1]] with probability [0.05388524671461696, 0.0011426040901520863]
Symbolics.Num[-0.9125775986692777, exp(x[2])] with probability [0.006522697942050591, 0.02707702334938692]
Symbolics.Num[19333.689074365135, exp(x[2])] with probability [0.0006005975724607248, 0.10004718259206825]
Symbolics.Num[sin(exp(x[2])), x[1]] with probability [0.047892106785117454, 0.04565968696366679]
Symbolics.Num[sin(exp(x[2])), exp(x[2])] with probability [0.047892106785117454, 0.10004718259206825]
Symbolics.Num[1.2246467991473532e-16, exp(x[2])] with probability [0.017199511543924042, 0.02707702334938692]
Symbolics.Num[exp(x[2]^2), x[2]^2] with probability [0.0015392645555503143, 0.00936874746648553]

We see the networks proposals are not very certain. Hence, we will train for some more epochs and look at the output again.

Flux.train!(net, X, Y, ADAM(1e-2), 900, routes = 100, nbest = 3)

for i in 1:10
  route = rand(net)
  prob = probability(net, route)
  eq = simplify.(net(x, route))
  print(eq , " with probability ",  prob, "\n")
end
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]
Symbolics.Num[sin(exp(x[2])), exp(x[2])] with probability [0.002425719500856979, 0.956500412755264]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [0.8965549834356896, 0.956500412755264]

The network is quite certain about the equation now, which is in fact our unknown mapping. To extract the solution with the highest probability, we set the temperature of the underlying distribution to a very low value. In the limit of t ↦ 0 we approach a Dirac distribution, hence extracting the most likely terms.

set_temp!(net, 0.01)
route = rand(net)
prob = probability(net, route)
eq = simplify.(net(x, route))
print(eq , " with probability ",  prob, "\n")
Symbolics.Num[sin(3.141592653589793x[2] + x[1]), exp(x[2])] with probability [1.0, 1.0]

The same procedure is automated in the solve function. Using the same data, we wrap the algorithm's information in the OccamSR struct and define a DataDrivenProblem:

# Define the problem
ddprob = DirectDataDrivenProblem(X, Y)
# Define the algorithm
sr_alg = OccamSR(functions = Function[sin, +, *, exp], skip = true, layers = 3, constants = [π])
# Solve the problem
res = solve(ddprob, sr_alg, ADAM(1e-2), max_iter = 1000, routes = 100, nbest = 3)
Explicit Result
Solution with 2 equations and 0 parameters.
Returncode: success
L2 Norm Error: 0.0
AICC: Inf

Within solve, a network is generated using the information provided by the DataDrivenProblem (states, control, independent variables, and the specified options). Then the network is trained, and finally the equation with the highest probability is extracted by setting the temperature as above. After computing additional metrics, a DataDrivenSolution is returned where the equations are transformed into a Basis usable with ModelingToolkit.

The metrics can be accessed via:

metrics(res)
(Probability = 0.8425497124024455, Error = 0.0, AICC = Inf, Probabilities = [0.898724142309076, 0.9374953589626484], Errors = [0.0, 0.0], AICCs = [Inf, Inf])

and the resulting Basis by:

result(res)
Model ##Basis#80523 with 2 equations
States : x[1] x[2]
Independent variable: t
Equations
Differential(t)(x[1]) = sin(3.141592653589793x[2] + x[1])
Differential(t)(x[2]) = exp(x[2])
Info

Right now, the resulting basis is not using parameters, but raw numerical values.