Scientific theories & machine learning



Peter Norvig, director of research at Google, wrote a few interesting articles on machine learning in natural language processing. In this one, he makes the case for using machine learning instead of handmade theories in complex domains:

But to be clear: the methodology still involves models. Theory has not ended, it is expanding into new forms. Sure, we all love succinct theories like \(F = ma\). But social science domains and even biology appear to be inherently more complex than physics. Let's stop expecting to find a simple theory, and instead embrace complexity, and use as much data as well as we can to help define (or estimate) the complex models we need for these complex domains.

In the same vein, just a tad more radical, Chris Anderson wrote The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.

Machine learning is a puzzle for scientists. Ever since, well, the early days of science, we've relied primarily of predictions to select the best theories. Quantum physics didn't become popular because of its fancy name, and certainly not because it was intuitive. It gained momentum because it worked, it was able to make correct predictions where other theories faltered. Machine learning is different, it is a set of algorithms designed to build models. Theories to build theories automatically from data. The fundamental difference between quantum physics and the theories generated with statistical machine learning is that you can peek inside the model of the former to get a clear understanding of what makes quantum systems tick, while the latter is often an opaque black box. There are also important differences in how theories are evaluated and selected (side-note: I'll use theory and model interchangeably).

There aren't enough discussions about machine learning and how it relates to the traditional role of theoretical scientists. Machine learning has been surprisingly effective at cracking complex problems in a wide array of domains. It's almost impossible to use a smartphone or browse the internet for a few minutes without being fed the output of several machine learning algorithms. Meanwhile, fields like theoretical population genetics, theoretical ecology, natural language processing, and many others, struggle with complexity. I enthusiastically support Norvig's call to embrace complexity, but at the same time we lose something fundamental to science by relying on statistical machine learning approaches. So what is the place of machine learning? I doubt there is a simple answer to the question, but machine learning represents an important paradigm shift, we've never relied so much on models that were not directly built by humans. We won't be able to avoid serious discussions about how machine learning relates to the traditional role of theory in science.

So let's talk a bit about machine learning and scientific theories.

The queen of science

Traditional theory-making is symbolic. We find equations with symbols representing objects and functions establishing relations between objects. No need to look very far, Norvig quotes Newton's second law of motion \(F = ma\), which establishes that force equals mass times acceleration.

The beauty of such theories is that they are, implicitly, all part of that big corpus of knowledge we call science. If you have an equation somewhere with acceleration, you can replace it with \(F/m\), and the fact that you can transform \(F = ma\) into \(a = F/m\) is itself part of our understanding of mathematics. It seems silly and obvious, but it's really the strength of these theories: they are all part of a single interconnected base of knowledge about the universe. Henri Poincaré wrote:

The Scientist must set in order. Science is built up of facts, as a house is with stones. But a collection of facts is no more a science than a heap of stones is a house.

Theory is the house, it's what allows us to structure knowledge. A statistical model found by a machine learning algorithm, even a probabilistic graphical model, doesn't do much to build the house. As effective as it may be to solve a specific problem, it fails as an integrative force. The issue is probably best explained by a table from Russell & Norvig (him again!?!):

Language Ontological commitment Epistemological commitment
Probability Theory Facts Degree of belief \(\in [0, 1]\)
Logic (First-Order) Facts, objects, relations. True | False | Unknown

The key point is that statistical models are inherently limited in their ability to build knowledge bases, since their framework (probability theory) only recognizes facts, but no objects or relations between objects.

If we want to be fair in our comparison of machine learning and traditional theory-making, we must take into account the ability of symbolic theories to connect with each others. A statistical model that fits a dataset well and generate good predictions is definitely useful, but it does not contribute to knowledge in the same way as a symbolic model.

The great obsession

Well then, traditional theory-making is good, and hipster machine learning is bad, right? Not so fast. The enthusiasm for machine learning comes from the fact that many problems involve complex relationships between many variables and, before machine learning, this large class of problems remained mostly out of our reach. It's hard, perhaps impossible, to discover these theories by hand. Evolutionary theory offers a great example.

Things got serious for evolutionary biology in the 1910s-1920s as Wright, Fisher, and Haldane founded the field of theoretical population genetics. In a short period of time, Darwin's natural selection was clearly defined in terms of Mendelian genetics, and other mechanisms like random genetic drift were described. Fundamental equations were written, new maths were developed along the way, and all seemed well. A few decades later, with the rise of molecular biology, we started to have a clearer view of biodiversity in the wild. Unsurprisingly, we were (again) surprised by how diverse and complex biodiversity was. The world was much more diverse than anticipated. Much more than our theories predicted. Theoretical scientists got back to work to find a theory that would explain how so much diversity could be maintained.

Gillespie, a prominent theoretician, called this the Great Obsession. We know the basic mechanisms underlying evolution, but we just can't find an equation to relate them in an effective predictive model. We even have a pretty good idea why it's so difficult: selection fluctuates in time and space, linkage and positive selection combines to draft neutral and even deleterious mutations, and the effect of mutations vary (a lot) across the genome. Knowing all this didn't help us find a solid theory. In 2008, Matthew Hahn wrote a damning paper that illustrates just how far we are from resolving the problem of biodiversity. His major qualm is that, since we have no solid predictive model for diversity, we tend to rely on neutral models (...and they just don't work). In his words:

The consequence of this is that we have tied ourselves into philosophical knots by using null models no one believes but are easily parameterized.

It's a trap! A model is widely believed to be wrong, generate poor predictions, but is so easy to parametrize that we use it. To be clear: we are not trading a bit of accuracy for easy parametrization, we are adopting models that are plain wrong (Hahn's paper is highly recommended for anyone interested in evolution or biodiversity). That's not how science is supposed to work. And it's just one example of how theory is difficult for humans to build when several variables are involved. An unfortunate side-effect is to demote theory, to use it to build toy-models supposedly used to bring "insights", but how can we expect good insights from theories that were not validated in the first place? Again, it's not how science (and theory) works, and this "theory-as-insights" philosophy risks confirming our biases as we inevitably build theories with biases and do not look for the theory's effectiveness. In the excellent Verification, validation, and confirmation of numerical models in the Earth sciences, Naomi Oreskes and her colleagues ends by saying:

[W]e must admit that a model may confirm our biases and support incorrect intuitions.

On the other hand, machine learning algorithms have built effective model for voice recognition, image classification, several problems in natural language processing, ecology, molecular biology... They have proven over and over again their ability to build models when hundreds or thousands of variables are involved. Theoretical scientists can't do that.

A false dilemma

So where does it lead us? In complex domains, statistical machine learning give us good predictive models, but it's difficult, if not impossible, to understand what is going on inside the model. And more importantly, the theories generated don't connect with other theories like symbolic models. Picasso famously said that computers are useless, they only give answers. I guess a theoretician could quip that machine learning algorithms are useless, they only give specialized prediction machines. For a long time, being able to predict was our way to know we understood a system. General relativity replaced the classical theory of gravity because it made better predictions, and we could peek inside the equations to gain an understanding of how different objects interact to generate gravity. Theoretical understanding and predictive power intertwined, but only in domains simple enough to expect a human to figure out the inner working of the system.

In many cases, statistical machine learning is good enough. Throw data at an algorithm, get an effective predictive model, and you're done (well, it's a bit trickier in practice, but you get the idea). We don't necessarily need theoretical understanding of how an array of pixels of different colors represent a cat. Dogs are more awesome anyway. The issue is that, as scientists, we often do need more than effective models built for a specific dataset. Unification, synthesis, integration, whatever you want to call it, being able to connect apparently distinct phenomenons into a single theory is a key goal of science.

The good news is that there is a way to unify traditional theory-making with machine learning. This isn't some speculation about how the field of A.I. / machine learning could evolve, this is already being done successfully. I'm currently working on a project with Michele Filannino where the initial model was built by experts who figured out the right set of formulas, connecting many NLP tasks together into a relatively small, and highly effective, model. Our plan involves using machine learning to learn weights for these formulas and then using another machine learning approach to look for new formulas or more effective variants of the existing model. The model is not a black box, it's a set of formulas. It's definitely possible to use machine learning to learn symbolic models, to improve our hand-made elegant theories, and to contribute to the synthesis of a domain. I would argue that, in complex domains where traditional theory-making struggles to find effective theories, this hybrid symbolic / probabilistic approach is probably necessary. Markov logic, relational sum type networks, and many other approaches to machine learning allow us to interact with the model in ways that are impossible with purely statistical machine learning algorithms. It's not necessarily a better approach in general, but it seems to align better with the role of theory in science.

let world = "世界" in print $ "Hello " ++ world ++ "!"