A gentle introduction to statistical relational learning: maths, code, and examples

Tweet

2015.07.13

Statistical relational learning is a branch of machine learning (A.I.) devoted to unify probability theory and logic. I'll write another post later to explain the motivation and a bit of history of this fascinating branch of study, but here I want to focus on a concrete example, with detailed maths and code.

The approach to statistical relational learning explained here is called Markov logic network (MLN), discovered in 2006 by Richardson and Domingos. Their paper has a nice simple example of MLN applied to the relationship between smoking and cancer. However, it's a bit hard to follow unless you're used to read papers on both logic and probabilistic graphical models. In this post, I will mostly follow their smoking/cancer example, but I will try to be much more explicit. I'll also do a demonstration with Faun, a small implementation I wrote for playing with statistical relational models.

A Markov logic network is simply a set of formulas written in first-order logic, each associated with a weight. We'll use this for our examples:

Statement Weight
Smoking causes cancers 1.5
If two people are friends and one smokes, then so does the other 1.1

Using a more formal representation, with the weight following the first-order logic formula, we get:

\[\forall x: Smoking(x) \Rightarrow Cancer(x), 1.5;\] \[\forall x, y: Friend(x, y) \land Smoking(x) \Rightarrow Smoking(y), 1.1;\]

And that's our Markov logic network. If you don't know much about first-order logic, I wrote a short introduction here that is more than enough to understand Markov logic networks.

The grand idea of statistical relational learning is that, in pure logic, a world is false if it violates a single formula, but with Markov logic networks, a world is less likely if it violates formulas, especially if it violates a formula with a high weight. Statistical relational learning has important advantages on probabilistic approaches: a first-order logic formula is simple to understand and interpret, plus it can be manipulated by humans and computers in ways a naked probabilistic model can't. I think the greatest advantage of Markov logic networks, and statistical relational approaches in general, is to go beyond data models to form a general knowledge base. Just like old expert systems, but without the inflexibility of logic.

But enough with the background, let's look at an example.

From Markov logic networks to inference in Markov networks

The first odd thing with Markov logic networks, is that there's no network (yet). All we got is a set of weighted first-order logic formulas. A Markov logic network can be seen as a template for Markov networks. That is: there is no network in the Markov logic network, but we'll use it to generate networks. Take the formula:

\[\forall x: Smoking(x) \Rightarrow Cancer(x), 1.5;\]

x is a variable, and to get a concrete model for inference from this template, we need to apply constants (real objects) to the MLN so we can replace these variables. So in essence:

\[\mbox{MLN (weighted formulas) + Constants} \rightarrow \mbox{Markov Network}\]

If you're confused, don't worry, it'll become clear with examples. Let's say we are interested in the relationship between smoking/cancer/friendship for three constants: Jerry, Elaine, George. We can apply these constants to the formula to get a set of ground formulas. The term ground means the variables are replaced by constants, that is, concrete objects. For example in mathematics you probably encountered first-order logic formulas like:

\[\forall x: Add(x, 0) = x.\]

We can ground this first-order logic formula by replacing x with an actual integer:

\[Add(47, 0) = 47.\] \[Add(1729, 0) = 1729.\]

Similarly, applying the set of constants \(\{Elaine, Jerry, George\}\) to our first formula yields a set of ground formulas:

\[Smoking(Elaine) \Rightarrow Cancer(Elaine), 1.5;\] \[Smoking(Jerry) \Rightarrow Cancer(Jerry), 1.5;\] \[Smoking(George) \Rightarrow Cancer(George), 1.5;\]

We could do the same thing with the other formula, which has two variables:

\[Friend(Elaine, Elaine) \land Smoking(Elaine) \Rightarrow Smoking(Elaine), 1.1;\] \[Friend(Elaine, Jerry) \land Smoking(Elaine) \Rightarrow Smoking(Jerry), 1.1;\] \[Friend(Elaine, George) \land Smoking(Elaine) \Rightarrow Smoking(George), 1.1;\] \[Friend(Jerry, Elaine) \land Smoking(Jerry) \Rightarrow Smoking(Elaine), 1.1;\] \[Friend(Jerry, Jerry) \land Smoking(Jerry) \Rightarrow Smoking(Jerry), 1.1;\] \[Friend(Jerry, George) \land Smoking(Jerry) \Rightarrow Smoking(George), 1.1;\] \[Friend(George, Elaine) \land Smoking(George) \Rightarrow Smoking(Elaine), 1.1;\] \[Friend(George, Jerry) \land Smoking(George) \Rightarrow Smoking(Jerry), 1.1;\] \[Friend(George, George) \land Smoking(George) \Rightarrow Smoking(George), 1.1;\]

We'll get our network from these ground formulas. From the groundings of the two formulas, we get a set of predicates:

\[\{Smoking(Elaine), Smoking(Jerry), Smoking(George), Cancer(Elaine),\] \[Cancer(Jerry), Cancer(George), Friend(Elaine, Elaine), Friend(Elaine, Jerry),\] \[Friend(Elaine, George), Friend(Jerry, Elaine), Friend(Jerry, Jerry),\] \[Friend(Jerry, George), Friend(George, Elaine),\] \[Friend(George, Jerry), Friend(Jerry, George)\}\]

Markov logic networks provide the structure to answer question of the type

\[P(Cancer(George) \mid Smoking(Jerry), Friend(Jerry, George)),\]

where our variables (in the probabilistic sense) are the ground predicates. Of course, we could generate a completely different set of ground formulas and ground predicates if we apply, say, the constants \(\{William, Anastasia, Kara, Saul, Karl, Tory, Felix, Laura\}\), or any number of objects we're interested in.

To finally see the network and answer probabilistic queries, we'll create one node for each ground predicate, and link all nodes that are in the same ground formula. In our example with the constants \(\{Elaine, Jerry, George\}\), we get the following Markov network:

ground network

That said, to make it easier to follow, we'll generate a simpler network by applying only two constants to the same two formulas: \(\{Kara, Lee\}\). We get:

Kara Lee network

To understand how to query the network, it's easier to focus on the factor graph. In the factor graph, there is a factor for all ground formulas, and all the ground predicates found in the ground formula are linked to the factor:

Kara Lee network

By applying \(\{Kara, Lee\}\) to our two formulas we got the ground formulas

\[Smoking(Kara) \Rightarrow Cancer(Kara), 1.5;\] \[Smoking(Lee) \Rightarrow Cancer(Lee), 1.5;\] \[Friend(Kara, Kara) \land Smoking(Kara) \Rightarrow Smoking(Kara), 1.1;\] \[Friend(Kara, Lee) \land Smoking(Kara) \Rightarrow Smoking(Lee), 1.1;\] \[Friend(Lee, Kara) \land Smoking(Lee) \Rightarrow Smoking(Kara), 1.1;\] \[Friend(Lee, Lee) \land Smoking(Lee) \Rightarrow Smoking(Lee), 1.1;\]

And you can see all the six ground formulas have a corresponding factor (the squares) in the graph. From there, if we want to compute \(P(X = x)\), we use the equation

\[P(X = x) = \frac{1}{Z}\exp\left(\sum_i w_ig_i(x) \right),\]

where \(w_i\) is the weight of the formula associated with the ith factor, \(g_i\) equals 1 if the formula is true given the values of the predicates or 0 if it's false, and \(Z\) is a normalizing constant, that is: it's the sum of the values of all possible assignments:

\[Z = \sum_{x' \in \ \mathcal{X}}\exp\left(\sum_i w_ig_i(x') \right),\]

with \(\mathcal{X}\) being the set of possible assignments. Now, let's try to compute the probability of Kara and Lee being mutual friend, neither of them being friend with themselves, Kara smokes and has cancer, but Lee neither smokes nor has cancer. We'll get the following network (green = true predicates, red = false predicates):

Kara Lee network

The value of the predicates can then be used to determine the factors that are true (worth 1, in green) and those that are false (worth 0, in red):

Kara Lee network

If you are confused about how factors are resolved, it's probably because you misinterpret the logic symbol implies \(\Rightarrow\), check here for clarifications. For example, the factor for \(Smoking(Lee) \Rightarrow Cancer(Lee)\) is true, because \(False \Rightarrow x\) returns true regardless of whether x is true or false.

With the factors resolved, we can compute the probability:

\[\frac{1}{Z}\exp\left(1 \times 1.1 + 1 \times 1.1 + 1 \times 1.1 + 0 \times 1.1 + 1 \times 1.5 + 1 \times 1.5\right),\] \[\frac{1}{Z}\exp\left(3 \times 1.1 + 2 \times 1.5\right),\] \[\frac{544.57}{Z}.\]

The normalizing constant is \(Z = 229210.5024\), so our probability is:

\[\frac{544.57}{229210.5024} = 0.0023759.\]

If we were to flip Cancer(Kara) to false, we'd get a lower probability (because smoking causes cancer):

\[\frac{1}{Z}\exp\left(3 \times 1.1 + 1 \times 1.5\right) = \frac{121.51}{229210.5024} = 0.00053012.\]

Code Example

In this section we'll perform an exact inference on a few MLNs. If you want to follow the examples, you can copy the library with:

$ git clone https://github.com/PhDP/Faun.git
$ cd Faun
$ cabal install

The code is tested on both Linux and Windows, and it should work fine on OSX with the Haskell platform installed. Right now, it performs only exact inference, which is useful for tests... but it does not scale since the Markov networks generated by Markov logic are humongous even with a few constants. Hopefully there are many good algorithms for inference, but for now they are not sufficiently tested in Faun. Anyway, exact inference will be enough for exploring a few models.

First, we launch an interactive console from the root of the code:

$ cabal repl

Then, we'll load the Markov logic network module:

ghci> import Faun.MarkovLogic

The most straightforward way to build a Markov logic network is with fromStrings. This function takes an array of strings, each of which must be a valid first-order logic formula followed (or preceded) by a number (the weight of the formula). In our smoking example we have:

ghci> let mln = fromStrings ["∀x Smoking(x) ⇒ Cancer(x) 1.5", "∀x∀y Friend(x, y) ∧ Smoking(x) ⇒ Smoking(y) 1.1"]

The strings were copy-pasted from Richardson and Domingos' paper, but the parsers is flexible and will accept a keyboard-friendly form too:

ghci> fromStrings ["1.5 forall x Smoking(x) implies Cancer(x)", "1.1 forall x, y Friend(x, y) and Smoking(x) implies Smoking(y)"]

We can see the structure of the network simply by typing its name:

ghci> mln
1.5                     ∀x Smoking(x) ⇒ Cancer(x)
1.1                     ∀x ∀y Friend(x, y) ∧ Smoking(x) ⇒ Smoking(y)

Since a Markov logic network is a template for Markov networks. To get a Markov network, we need to apply a set of constant to the Markov logic networks Following our example in the last section we'll use:

ghci> let cs = ["Elaine", "George", "Jerry"]

Then, we can query the network, say, what is the probability that Jerry has cancer?

ghci> ask mln cs "P(Cancer(Jerry))"
Just 0.6040175344121184

The function ask takes a Markov logic network, a list of terms (represented as a list of strings), and a string query. It will return Just P, with P being a probability in the [0.0, 1.0] range, or Nothing if the parser fails to read the query. To make the process a bit easier we'll create a function query with the first two arguments supplied so we don't need to repeat them ad nauseam:

ghci> let query = ask mln cs

Then we can ask queries with this function:

ghci> query "P(Cancer(Jerry), Cancer(Elaine))"
Just 0.3696834237837972
ghci> query "P(Cancer(Jerry) | Smoking(Jerry))"
Just 0.8175744761936782

While the formulas look distinct in the Markov logic network, the fact that they share predicates link them. So, even if there is no direct relationship between friendship and cancer, we have:

ghci> query "P(Cancer(Jerry) | Smoking(Elaine))"
Just 0.6506081590969498
ghci> query "P(Cancer(Jerry) | Smoking(Elaine), Friend(Elaine, Jerry))"
Just 0.7043948532279771

...just as expected, because friends tend to smoke, so Elaine being Jerry's friend increases the chance that Jerry is smoking, and if we know he doesn't we'll get:

ghci> query "P(Cancer(Jerry) | Smoking(Elaine), Friend(Elaine, Jerry), !Smoking(Jerry))"
Just 0.4999999999999964
ghci> query "P(Cancer(Jerry) | !Smoking(Jerry))"
Just 0.5000000000000238

The cool thing with all of this is that we already have a rich structure for inference from just two simple logical formulas, weights, and a list of objects to ground the formulas.

We can add a logic formula to the network with the tell function. It takes a string (just like fromStrings takes a list of strings), an existing Markov logic network, and will return a new Markov logic network with the formula added. Let's say we want to add a rule that friends of friends are friends, we could add this rule with a weight of 2.0 with:

ghci> let mln' = tell "2.0 A.x,y,z Friend(x, y) and Friend(y, z) => Friend(x, z)" mln
ghci> mln'
1.5                     ∀x Smoking(x) ⇒ Cancer(x)
2.0                     ∀x ∀y ∀z Friend(x, y) ∧ Friend(y, z) ⇒ Friend(x, z)
1.1                     ∀x ∀y Friend(x, y) ∧ Smoking(x) ⇒ Smoking(y)

We'll build an ask function for this network using the same constants (Jerry, Elaine, George):

ghci> let query' = ask mln' cs

Let's compare how our first MLN behaved compare to our new one:

ghci> query "P(Smoking(George) | Smoking(Jerry), Friend(Jerry, Elaine), Friend(Elaine, George))"
Just 0.5877406718485353
ghci> query' "P(Smoking(George) | Smoking(Jerry), Friend(Jerry, Elaine), Friend(Elaine, George))"
Just 0.7080082227672424

Same query, same information, but with a different model (the MLN), we get different answers. Now, something is bugging me with the original model:

ghci> query "P(Cancer(Jerry) | Smoking(George))"
Just 0.6506081590969498
ghci> query "P(Cancer(Jerry) | Smoking(George), Friend(George, Jerry))"
Just 0.7043948532279771
ghci> query "P(Cancer(Jerry) | Smoking(George), Friend(Jerry, George))"
Just 0.6506081590968945

The problem is that friendship is assymmetricial in this MLN. If we know George is friend with Jerry, there's a very good chance that Jerry is friend with George, and thus influenced by his smoking habit. The nice thing with Markov logic is that all formulas are connected, so we can fix this issue by adding a formula, our new Markov logic network is:

\[\forall x: Smoking(x) \Rightarrow Cancer(x), 1.5;\] \[\forall x, y: Friend(x, y) \land Smoking(x) \Rightarrow Smoking(y), 1.1;\] \[\forall x, y: Friend(x, y) \iff Friend(y, x), 2.0;\]

The last formula has the \(\iff\) operator, which is true either when both sides are true, or when both sides are false. Thus, this formula says: if \(x\) is friend with \(y\), then \(y\) is friend with \(x\), and if \(x\) is not friend with \(y\), then \(y\) is not friend with \(x\). Since it's probabilistic, it doesn't need to be true all the time, but I give it a fairly high weight. And now we have:

ghci> let mln'' = tell "ForAll x, y: Friend(x, y) iff Friend(y, x) 2.0" mln
ghci> mln''
1.5                     ∀x Smoking(x) ⇒ Cancer(x)
1.1                     ∀x ∀y Friend(x, y) ∧ Smoking(x) ⇒ Smoking(y)
2.0                     ∀x ∀y Friend(x, y) ⇔ Friend(y, x)
ghci> let query'' = ask mln'' cs
ghci> query'' "P(Cancer(Jerry) | Smoking(George))"
Just 0.6506081590969105
ghci> query'' "P(Cancer(Jerry) | Smoking(George), Friend(George, Jerry))"
Just 0.7043948532279295
ghci> query'' "P(Cancer(Jerry) | Smoking(George), Friend(Jerry, George))"
Just 0.7018023327893508

Again, with the same information, the same query, just by adding a simple formula we were able to get a richer model. Of course, there's much more to Markov logic networks than simple inference, we can learn new formulas from data, combine existing knowledge bases, transfer knowledge between domains...

let world = "世界" in print $ "Hello " ++ world ++ "!"