‘Machine Scientists’ Distill the Laws of Physics from Raw Data
I am hoping this also opens up more opportunity to leverage Lisp's symbolic powers. I had great fun with Structure and Interpretation of Classical Mechanics (SICM), and recently a paper on analyzing music using symbolic regression [0], and another on symbolic regression with Lisp [1]. Julia and CL seem perfect for this with Mathematica as another option for quickly playing with ideas.
[0] https://www.researchgate.net/publication/286905402_Symbolic_...
[1] https://towardsdatascience.com/symbolic-regression-the-forgo...
I'm really curious to see if efforts like this, and from computational augmentation and automation more generally, will yield progress on symbolic modeling of large natural systems.
There is a segment in Adam Curtis's "All Watched Over By Machines of Loving Grace" which describes scientists trying to model the ecology of a prairie. As described by Curtis, the more data they collected and more complex their model, the worse the predictions of the model became. It feels like the lessons of the past couple generations (many decades) have been that symbolic models don't compose together easily; symbolic analysis only works in simple systems and has hit diminishing returns; numeric and ML methods work well enough; etc. I'm curious if better tooling (augmentation and/or automation) can push through some of these challenges and yield real understanding. Or if there are some fundamental scaling problems that can not be overcome.
I found it curious that one of the implementations of symbolic regression (the "machine scientist" referenced in the article) is a Python wrapper on Julia: https://github.com/MilesCranmer/PySR
I don't think I've seen a Python wrapper on Julia code before.
I worked on genetic programming for my MSc in econometrics, used it to build human-readable risk models for time series data. It worked really well. The article seems to call these methods genetic algorithms, though Koza himself (the "godfather" of genetic algorithms & genetic programming) was pretty keen on distinguishing the two. The main difference is that genetic algorithms represent solutions as linear strings / vectors whereas genetic programming works on tree structures. Maybe just splitting hairs here though.
Also have a look at SINDy (Sparse Identification of Nonlinear Dynamics) by Steve Brunton et al.
Basically approximating a solution, and then finding an equation from some data. Nothing wrong with writing down an approximation to some problem. An approximation can be very useful. As long as you know it’s limitations it’s fine to answer questions where it applies. But this can not replace deeper understanding. It’s all good to write down approximate solutions to some hard n body problem or non linear pde, just don’t look too far ahead in time, or update your model with fresh data if you need to.
This article seems to be about ML but good old fashioned brute force searches are also possible.
It's trivial to find formulas that fit data to within the variable's uncertainties. You can easily drown in such output. There are two main challenges with this approach: 1. make sure you have the right "ingredients" (factors) for the search, and 2. filter the output so that human reviewers don't waste time on obvious nonsense equations.
Filtering output is not terribly difficult. Some assessment of complexity is required and depending on the formula a symmetry score might be useful. Some actual physics equations are quite complex so you have to be expecting a simple equation or filtering output will be less effective.
Making sure you have all of the right potential factors available is very hard. Expanding the input factor set extends the search space and time.
Aw, no mention of the two Robot Scientists, Adam and Eve, who don't just invent theories but also design and run their own experiments to verify their theories?
Wikipedia page:
https://en.wikipedia.org/wiki/Robot_Scientist
Nature publication:
https://www.nature.com/articles/nature02236
I guess US media don't realise there's science and technology outside of the Americas eh?
For those interested, this is the relevant article (I think): https://www.science.org/doi/10.1126/sciadv.aav6971 - I'm partway through it now and am having a blast.
“ These algorithms resemble supercharged versions of Excel’s curve-fitting function, except they look not just for lines or parabolas to fit a set of data points, but billions of formulas of all sorts. In this way, the machine scientist could give the humans insight into why cells divide, whereas a neural network could only predict when they do.”
It’s very cool witnessing the consequences of Moore’s law; as the cost of billions of calculations goes to pennies, this becomes easier and easier.
Since the complexity of nature is unbounded, I wonder if the slow down of Moore’s law will represent a plateau of what can be symbolically discovered.
For example, will the next set of scientistic discoveries require quintillions of combinatorial checks that can’t be accomplished in a human lifetime?
> A constant implied that it had identified two proportional quantities — in this case, period squared and distance cubed. In other words, it stopped when it found an equation.
Very interesting article. Thanks for posting.
I've been thinking about the physics equation and its relation to proportionality. Not all physics equations are proportionalities because in physics the equality sign is loaded.
To me, proportionality is fundamental not the physics equation. So I would have written the last sentence of the quote as "...it stopped when it found [a proportionality].
Brilliant! I am inspired to build my own machine scientists.
"They started by downloading all the equations in Wikipedia. They then statistically analyzed those equations to see what types are most common. This allowed them to ensure that the algorithm’s initial guesses would be straightforward — making it more likely to try out a plus sign than a hyperbolic cosine, for instance. The algorithm then generated variations of the equations using a random sampling method that is mathematically proven to explore every nook and cranny in the mathematical landscape."
Wikipedia contains equations from many different branches of math.
Aren't mathematical equations written in all sorts of notations (really obscure notations sometimes, for obscure branches of math), depending on the branch of math they're used in?
How would this program even be able to use equations written in a notation it doesn't understand? Even if it somehow understood the notation, that doesn't mean the algorithm understood the branch of math the equation was for.
This all sounds unworkable to me, unless they're somehow limiting the equations they use to some branch of math they already understand.
This is also how a few hedge funds ( WorldQuant and spinoffs ) get alpha signals/factors for their long/short portfolios. They get 95% of their aggregate alpha/returns from Genetic Programming and not human quant researchers.
Wouldn't this be a "what we know about physics" distillation? I am basing this assumption on the "Raw data" as being that of already performed experimentation etc.
Do they apply Occam's razor, where simpler equations are favored over more complicated ones?
What could go wrong?