Tuesday, September 1, 2009

Least Squares Approximation

Suppose you are in a science class and you receive these instructions:

Find the temperature of the water (in degrees Celsius) at the times 1, 2, 3, 4, and 5 seconds after you have applied heat to the container. Conduct your experiment carefully. Graph each data point with time on the x-axis and temperature on the y-axis. Your data should follow a straight line. Find the equation of this line.

The data from the experiment looks like this when charted and graphed:


Notice that our data points don't fall exactly on a straight line as they were supposed to, so how are we going to find the slope and intercept of the line?

This is a common problem with experimental sciences because the data points that we measure seldom fall on a straight line. Therefore, scientists try to find an approximation. In this case, they would try to find the line that best fits the data in some sense. The first problem is to define "best fit." It is convenient to define an error as a distance from the actual value of y for x (the value that was measured in the experiment) to the predicted value of y for x. Therefore, it seems reasonable that the "best fit" line would somehow minimize the errors, but how? You could minimize the sum of the absolute values of the errors; this is called the l1 fit. It would also be reasonable to find the biggest error for each line and choose the line that minimizes this quantity; this is called the l infinity fit. However, the fit that is used most often is the l2 fit which is called the least squares fit. This method is called the least squares fit because it finds the line that minimizes the sum of the squares of the errors. Gauss developed this method to solve a problem when he was a young man (about the age of a high-school senior) to help his friend solve a chemistry problem. This is the fit that is most often used because it is the only one that can be found by solving a system of linear equations.

You have just read a lot of new information, so let's illustrate the concepts with our example. We have the graph of the data above. Now we need to guess which line best fits our data. If we assume that the first two points are correct and choose the line that goes through them, we get the line y = 1 + x. If we substitute our points into this equation, we get the following chart. The points and line are graphed below.


Therefore, the sum of the squares of the errors is 27. Do you think that we can do better than this?

If we choose the line that goes through the points when x = 3 and 4, we get the line y = 4 + x. Will we get a better fit? Let's look at it.


The sum of the squares of the error is 18. That is a better fit, but can we do even better?

Let's try the line that is half way between these two lines. The equation would be y = 2.5 + x. It looks like this:


The sum of the squares of the error is 11.25 with this line, so this is the best line yet. Can we do better? It doesn't seem very scientific or efficient to keep guessing at which line would give the best fit. Surely there is a methodical way to determine the best fit line. Let's think about what we want.

A line in slope-intercept form looks like c0 + c1x = y where c0 is the y -intercept and c1 is the slope. We want to find c0 and c1 such that c0 + c1xi = yi is true for all our data points:

c0 + 1c1   =   2
c0 + 2c1   =   3
c0 + 3c1   =   7
c0 + 4c1   =   8
c0 + 5c1   =   9

We know that there may not exist c0 and c1 that fit all these equations, so we try to find the best fit. We can write these equations in the form Xc = y (these are just new letters for our familiar equation Ax = b ) where


In general, we cannot solve this system because the system is usually inconsistent because it is overdetermined. In other words, we have more equations than unknowns (the unknowns are the two variables, c0 and c1, for which we are trying to solve). There is a system of equations called the normal equations that can be used to find least squares solution to systems with more equations than unknowns.

Theorem 8.1 Let X be an m by n matrix such that XTX is invertible, then the soltution to the normal equations, XTXc = XTy, is the least squares approximation to c in Xc = y.

Remark 25 It is important to remember that the solution to the normal equations is only an approximation to c for Xc = y. It is not equal to c because Xc = y is inconsistent, so it has no solution. In other words, there does not exist a vector, c, that makes Xc = y a true statement. Therefore, we use the normal equations to approximate c.

Remark 26 For now, you don't need to check to see if XTX is invertible because most of the systems that we encounter will meet this requirement. However, if you cannot find a solution to the normal equations, you should check to see if XTX is invertible.

The normal equations will give us the "best fit" line (or curve) every time according to the way we defined "best fit." The proof of this is at the end of this chapter. Let's try applying the normal equations to our system. First, we multiply so that we have a system that we can solve.


Now we can work with the augmented matrix and use Gauss-Jordan elimination to find the solution of the normal equations. This solution will be the coefficients of the line which give the best fit in the least squares sense.


When we graph and chart the line y = 0.1 + 1.9x, we get:


The sum of the squares of the error is 2.7. This is a great improvement over our guesses and we know that we cannot do any better. In general, if we have n data points, we solve XTXc = XTy with

matrices, matrices and matrix
The ellipse marks (written as dots or dots tell you to continue in the same pattern.

What if we are told that our data is not supposed to fit a straight line, but instead falls in the shape of a parabola? Consider the following data from another experiment:


We can find the curve that best fits our data in a similar manner. The general equation for a parabola is c0 + c1x + c2x2 = y. Therefore, we want to find the values of the coefficients, c1 c2, and c3, so that the curve we find best fits these equations:

c0 - 1c1 + 1c2   =   3
c0 + 0c1 + 0c2   =   1
c0 + 1c1 + 1c2   =   -1
c0 + 2c1 + 4c2   =   1
c0 + 3c1 + 9c2   =   3

Let us use the normal equations with matrix matrix and column vector


Now we can augment the matrix and solve using Gaussian elimination.


Back-substitution yields the coefficients


These coefficients indicate that the curve we want is equation Let's graph this curve and fill in our chart:



We find that the sum of the squared errors is 32/35 Using our definition of least squares "best fit," you will not be able to find a parabola that fits the data better than this one. In general, to find the parabola that best fits the data, you use the normal equations XTXc = XTy with


Notice that the normal equations used to find the best fit line and the best fit parabola have the same form. Do you think that we could expand this to higher degree polynomials? Yes, we can. In general, we use the normal equations XTXc = XTy with


where m represents the degree of the polynomial curve that you wish to fit and n represents the number of data points. The least squares "best fit" curve for these equations is c0 + c1x +c2x2 + … + cm - 1xm - 1 + cmxm. Remember that the degree is the highest power of the variable in your equation. A line is a first degree polynomial and a parabola is a second degree polynomial.

If we can find the best fit curve for any degree polynomial, why don't we always use a higher degree polynomial and fit the data better? After all, if we have n data points and fit them to a polynomial of degree n - 1, we will have a perfect fit every time because our systems would not be inconsistent. However, our goal is not just to find a curve that fits the data closely. Usually, we want the curve to predict what would happen between our data points. If we choose a curve that exactly fits all our data points, we are incorporating the error in our measurements into our model unless the model fits the data exactly (which occurs only rarely). Unfortunately, there is no set rule for deciding what degree polynomial should be used to fit the data. However, first and second degree polynomials provide the simplest models and should fit most of your data until you start modeling more complicated systems.

If you notice, we said that we usually fit a curve so that we can predict what would happen between our data points. Predicting an outcome between data points is called interpolation. Why didn't we say anything about predicting the behavior beyond our data points? Predicting an outcome beyond the data is called extrapolation. It is usually dangerous to extrapolate much beyond the data because we have no indication that the data will continue to follow the same curve since our curve was only fit to the data. For example, we measured the height of a teenage boy every year for a few years and charted his growth. The growth appeared linear, so we fit a line to the data and got y = 32 + 2.25x. We have graphed the data with age on the x-axis and height on the y-axis.


If we extrapolate back several years, this young man was over two and a half feet tall when he was born. According to this model, he will never stop growing, so he will be 8 feet 4 inches tall by the time he is 30 and almost 14 feet tall by the time he is 60. Do you think that this is an accurate prediction?

If the temperature at the airport on the 4th of July was in the 90's for two years in a row, would it be reasonable to predict that the temperature in January between those years was also in the 90's? No, it would not. We have two problems with this model. One problem is that we only have 2 data points. You can always find a line that fits the two points, but there is no reason to believe that the relationship between the day of the year and the temperature is a linear relationship. Also, we didn't take into account other factors that could affect our model such as the pattern of the seasons. These are problems that can arise when you model a situation. When we start modeling situations and using least squares to make predictions, we are entering the world of statistics. That means that we must think about what the data represents rather than just apply the normal equations. There are many interesting applications of statistics that you can explore in another course. However, using matrices, you already know one way to find a "best fit" curve for your data.

No comments:

Post a Comment