The Machine Learning Journey: 2014

Well, first of all, this post is a bit rantish, but after talking with some people, it seemed just fair to explicitly put examples of why I think Bishop's Machine Learning Book book offers overall a better learning experience than Murphy's Machine Learning: A Probabilistic Perspective

I'll present two didactic experiments, and I will have the point of view of someone versed in Probability and Statistics to an undergrad level, but not that much with ML.

So the 1st experiment is to introduce the reader to the EM algorithm via Gaussian Mixture Models (GMM), every book does it, and every book has its strengths and disadvantages.

Just so you can follow, this topic starts in page 352 in Murphy's Book and in page 430 in Bishop's.

Ok, so to start off, Murphy never references the equation of the Mixture of Gaussians (granted, is 5 pages before, but still, you need a finger in that page going back and forth), while Bishop essentially restates the whole Mixture of Gaussians paradigm, which he already did 200 pages before, is essentially the same text, but he goes to the whole problem of restating notation, what each term means and how the likelihood is calculated. He does mention the fact that he already did it 200 pages before (by putting a reference to the equation), but he just goes ahead anyway. Murphy does not even goes to the problem of saying where are the likelihood equations.

Furthermore, Bishop uses GMM as a motivation for EM, while Murphy's follow the years old formula of Model - > Numerical Recipe to solve it, and beware the numerical recipe has zero notation, and you have to go through the whole book looking for it.

So first experiment, Bishop is the clear winner.

Second experiment, introduce the reader to a new topic, not in Bishop's (because that is supposedly one of Murphy's advantages). In this case Deep Learning, which is the last chapter on the book, page 1000.

Ok, so we are presented with equation 28.1:

$p(h_1,h_2,h_3,v|\theta) = \cdots$

Without any context text, that equation is just useless. What are $h_1, h_2, h_3$, in the right hand side of the equation they are multiplied by some $w$. Remember, I am a newbie in ML, I am not supposed to know directed graphical models at all. Furthermore, there is no explanation or motivation for the model at all.

A good reference book, would ate least direct you where those terms where first introduced. So I went to the Notation sections, where oh surprise! $h$ is never explained, neither is $v$, I mean there is a $v$ for nodes of a graph, and this is a directed graphical model, so that makes sense.

So $v$ is any node in the graph. just to be sure I will search in the book. I assume that the previous chapter (latent variable models for discrete data, page 950) has the notation at the beginning, since per definition a deep net might be latent model for discrete data (here I had to pick a bit on my ML expertise). Remember, we still have no idea what $h$ stands for or $w$.

Ok, so the first thing I notice, is that $v$ is actually words in a document, so it is not nodes, but words? I do not do NLP, what are the words in a Bag of Words supposed to be? Features? examples?
Why words, why not something more general?

But at least we know $v$ are words. We still do not know what $h$ or $w$ is.

So perhaps is even further back.

Graphical Models Structure Learning (page 909), obviously, no notation on the first few pages, let's go over the chapter, I think I saw an $h$. Yeii!! success!! Page 924 defined $h$ as hidden variables.

So $v$ are words and $h$ are hidden variables, right? Ok cool, wait.... I just bumped into page 988, and here it says that let's overwrite notation (just because, I guess) and now $v$ is actually the input (which are the visible layers by the way, hence the $v$). So $v$ went from being nodes, to words to input that was formerly known as $y$ (like Prince!!)

Now, we understand the left hand side of equation 28.1, oh yeah, we haven't defined the weights yet (yeah those pesky $w$ that I just defined in a single line) but I'm too tired to even try. They are weights that mark the importance of each hidden or visible node, and are the things we are trying to find (there! print it and paste it at the top of the book, a single line!!!)

We essentially backtracked two chapters and wasted a ton of time looking for a single line where he defined the variables for the first time, and if you missed it, like I did with page 988, you are essentially doomed to have a bad understanding of the topic.

Anyway, I guess that is off my chest and I promise, this is the last post on Murphy's book, I guess my take home message is: This is probably a good book if you already know machine learning (and familiarized with the particular flavor of their notation), but is by no means an "essential reference for practitioners of modern machine learning". The last thing a practitioner wants is to go through the entire book so he can implement EM. Specially today's practitioners, but that is a rant for another day.

At the end of the day, I enjoy teaching ML to people, and introducing these fascinating concepts to them and is very frustrating that the community endorses a book that does a very poor job at addressing this issue.

See ya

I was going through Murphy's Machine Learning book to remember some ML concepts that I needed. While I had seen it really fast, and the sections I am most familiarized with are well written. I found out that the book feels very rushed. The online erratas are huge, and each reprinting just seems to have more.

These mistakes might be passable if you are familiarized with the material, but if you are learning it, and taking the equations stated there at face value you might have some issues.

Also, terminology is not clearly explained, and for the sake of saving space, he refers you back to the very first time he introduced it, which most of the time is one of the very first chapters. I kid you not, I had to go back three times to find out he never really explained one symbol in the equation.

I used mostly Bishop's book to start doing Machine Learning, and in contrast to Murphy's, it is a pretty self contained book, he goes extensively through doing a restatement of most of the nomenclature he uses through the book, which is useful if you don't feel like going back to the very beginning to figure out what is he talking about.

Some sections, like the sampling, is very well explained, and way better than in Bishop's, or any other book. However, the sampling examples assume that the reader is familiarized with his particular terminology, which just takes more time than it should.

But for an entry level Grad student, or even an undergraduate, the book is just not friendly enough, the code is not really well documented, so aside from reproducing the figures in the book, there is really not much aggregated value in having it available, most of the time you'll spend most of your time just figuring out what the code does, and since it is not cross referenced with the equations in the book, is not really tractable as a learning experience.

It also has its issues as a reference book, since as I told you, it does too much back tracking when it comes to equations.

Some good things, are its explanation of Boltzmann Machines, which Bishop just really lacks, mostly because it was written after the deep net boom.

The Machine Learning Journey

Wednesday, September 24, 2014

Why I just do not think Murphy's book is that good (Part 2, a case in point)

Wednesday, July 30, 2014

Why I wouldn't use Murphy's book to teach a Machine Learning Class

Blog Archive