Monday, February 27, 2012

An introduction to linear regression - Cost Function (ML for the Layman)

I've tried keeping away from posting tutorials on ML topics. Mainly because I did not feel well prepared to do it yet. I hardly think I'm prepared now, but I definitely can give it a better shot, and with your feedback, I can ,at least, get an idea on where I can improve.

Disclaimer: Even though this introduction will be basic, and I mean as basic as it can get. You'll still need some knowledge on matrix operations, basic algebra and calculus to get through some of the explanations.

Imagine you want to sell your car, let's say it is a Prius 2007 with 20,000 miles. It is in a very good condition and you would like to do a survey of how much does your car costs in the market. You definitively want to get, as a seller, the best price for it.

So, how do you price a car? If you know nothing about cars (like me), you can go to someone that has a better idea, in our case, it is the internet.

We can see that the price is set by things, like the age, the maker, the mileage, the overall condition, etc. We will call this things "features". So a set of features is basically the characteristics of our car or any object.

So which features to choose? For the sake of simplicity let's choose year and mileage. It's important to notice that there are whole researches around how to choose features, but we are first learners here, so we do not care about that. It is important to know, however, that having many features is not always better than having few features.

Now, that you have chosen a set of features, how do you compare one car to another? We now go and look for data. For car comparison, we can go to different web sites where you can see different combinations of these features for different cars. Your local newspaper's classifieds or craiglist.

Let's create a mock data set of 5 cars using only 2 features, age and mileage. All of them are Prius:

$mileage=(50000, 60000,54000,40000,20000)$

It is intuitive to think that the price of our car will be directly related to age and mileage. A low mileage and a recent year increases the price, while an old car with a high mileage has a lower price. So there should be an abstract way to write this relation.

To model this kind of data, we use linear regression, which states that a variable is the resutl of a linear combination of other variables. That is, our price actually obeys to some combination of mileage and year.

For the first element (12,000 USD for a 2007 model with 50,000 miles):


A second characteristic, is that every element has to share those $a_1$ and $a_2$ variables, so we can use those values with our own Prius, there would be no point in describing individual values for each car, when we want to find the best price for our car, so we can write a general equation:


How do we find the $a_1$ and $a_2$ that solve this model? How do I know that "1" and "1" are not good choices? Well, for starters, summing up the year and the mileage does not seem like a good idea.

We need some function that'll tell us how bad or how good our prediction is. This is usually called the "cost" function. How do we build a cost function, well, our intuition tells us that we have to compare (rest) the truth with our guess. So for our guess of "1" and "1" for $a_1$ and $a_2$:


We can see there is something off here, since we are looking for the minimum cost, we can make the values in $a$ large enough and we will have incredibly low values (negative values), so we squared them to have a nice function that has a lower bound (it does not go lower than certain value).


We now have an intuition that the best we can do is $cost=0$, since a squared number can never be negative, we cannot do any better than that.

This cost, however, is only the cost for 1 example, we need the cost for all our cars. So we sum all of them:

$Total cost^2=\sum_{i=1}^5(Price_i-(1\times Mileage_i+1\times year_i))^2$

We now have the intuition that our choice of 1 and 1 may not be a very good choice at all, just for kicks, lets put how much the cost would be for different values for $a_1$ and $a_2$:

a1,a2 Cost
0,1 675,707,259
1,0 6,595,340,000
1,1 7,252,615,259

The costs are terrible! There is, however, something happening for "0,1", since the cost decreased compared to "0,0". But there has to be something better than just random guessing right?

Our guess of 1,1 is just terrible. But we gained an intuition, we see that our solution has to be near the "0,1". Also, some of you may notice by now that no matter which values do we put, there is no way we are going to get a zero. It's just not possible. Not without some extra help, which we will call an offset.

Next time, we'll talk about optimization or how to stop you from trying everypossible combination by hand, how to use the offset and a bit about comparing pears and apples.

See you later

Remember to visit my webpage And if you want to keep up with my most recent research, you can tweet me at @leonpalafox.
Also, check my Google+ account, be sure to contact me, and send me a message so I follow you as well.