Building statistical models: Linear (OLS) regression


Every day, researchers around the world collect sample data to build statistical models to be as representative of the real world as possible so these models can be used to predict changes and outcomes in the real world. Some models are very complex, while others are as basic as calculating a mean score by summating several observations and then dividing the score by the number of observations. This mean score is a hypothetical value so it is just a simple model to describe the data. 

The extent to which a statistical model (e.g. the mean score) represents the randomly collected sample is known as the FIT of the model. A good fitted model will have small differences from the real world and as such will be a good replicate. If the model is used to make predictions about the real world, then we can be confident that it will be accurate, because the model is close to reality.
.
It is therefore imperative that when we fit a statistical model to a set of sample data, the model fits the data well. A poor fit will lead to inaccurate predictions.
 
Most models we use to fit to sample data are linear models (we try to summarize our observed sample data in terms of a straight line), as an example are mean scores, t-tests, ANOVA and regression. Linear models are based on a straight line expressed as y=mx+b. Linearity is, therefore, a basic assumption and can easily be detected via a scatter plot. 
 
While we can fit different models (e.g. straight line or curved line) to the same data, neither will be the 100% accurate so we need to determine which model fits the data best. In fact, we often ignore the more complicated “non-linear” or “curvilinear models” in commercial marketing research.
 
When we collect sample data from a population of interest we need to infer a few things about the population, called population parameters. As these population parameters are theoretical (as we don’t know their true value) we estimate them (e.g regression coefficients) from our sample that serves as statistical estimates of those population parameter values.  
.
Ordinary Least Squares” (OLS) for model fitting
To determine the best-fitted line, we can draw many lines through the data and eye-ball which line best describes the data (best-fitted line), or, a much easier way is to employ a technique called “ordinary least squares” (OLS) to determine the best-fitted regression model.  In a nutshell, it is done as follows: 
 
Our objective is to develop a regression equation to predict an outcome variable (Y) by using the equation of a straight line Y1= (b0 + b1X1)+e1. As we have several values (observations in our sample) for X (the single predictor) and Y (the outcome), the unknown parameters b1 [the gradient or slope referred to as beta] and b0 [the position of the intercept with the y-axis – alpha, also known as the regression coefficients], can be calculated. They are calculated by fitting a model (the straight line) to the actual sample data for which the sum of the squared differences between the model line and the actual sample data points are minimised (in other words we settle for the best-fitted line for the sample data which uses “the method of least squares” which means minimal differences between the sample data points and our model line).
 .
Once we fitted our model and have determined the values for b1 (slope) and b0 (Y-intercept), we can add the value of X (predictor) and then calculate the value for Y (outcome). In the SPSS output we read in the “Coefficients Table” the “B” value for the “Constant” which is b0, and the value of the predictor variable (gradient) right below it which is b1, then we can insert any value for X (e.g. advertising spent) to calculate the outcome (e.g. sales).
 
Reminder, to determine how good the fit of the regression model is, we need to compare the values predicted by our regression model (Y) with the actual values (Yi) collected in our sample. Any point in the sample data which does not lie exactly on our fitted regression line is the result of unexplained variance, so the variance in Y is not fully explained by the variance in X. The difference between actual and predicted values of Y is the error or referred to as residual. Outcome = (model) + error.
 
By summating all these residuals we can determine the overall goodness-of-fit of our model. The coefficient of determination (R2) tells us how much variance is explained by the model compared to how much variance there is to explain in the first place. Alternatively expressed, Ris the proportion of variance in the outcome variable that is shared with the predictor variable.

This was OLS modeling, in a nutshell.

_________________________________________

Related Posts:
For a review of Sum of Squared Differences, see this post: Means, Sum of Squares, Squared Differences, Variance, Standard Deviation and Standard Error

Further Reading:
Equation of a Straight Line: http://www.mathsisfun.com/equation_of_line.html
_________________________________________
/zza83
3 2 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments