Multiple Regression and Dummy Variables

Index

Multiple Regression | Dummy Variables |

blank | Return to the MSC 317 Home Page |

The mechanics of testing the "significance" of a multiple regression model is basically the same as testing the significance of a simple regression model, we will consider an F-test, a t-test (multiple t's) and R-sqrd. However, unlike simple regression where the F & t tests tested the same hypothesis, in multiple regression these two tests have different purposes. R-sqrd is still the percent of variance explained but is no longer the correlation squared (as it was with in simple linear regression) and we will also introduce adjusted R-sqrd. When considering a multiple regression (MR) model the most common order to interpret things consists of first looking at the R-sqrd, then testing the entire model by looking at the F-test, and finally looking at each individual coefficient individually using the t-tests.

**NOTE:** The term "significance" is a nice convenience but is very ambiguous in definition if not properly specified. Thus when taking this class you should avoid simply saying something is significant without explaining (1) how you made that determination, and (2) what that specifically means in this case. You will see from the examples that those two things are always done. If you cannot do that then any time you use the word "significant" you are potentially hurting yourself in two ways; (1) you won't do well on the quizzes or exams where you have to be able to be more explicit than simply throwing out the word "significant", and (2) you will look like a fool in the business world when somebody asks you to explain what you mean by "significant" and you are stumped. Remember if you can't explain your results in managerial terms than you do not really understand what you are doing.

** 1. The significance of the model - the F-test. **

**
When speaking of significance. we are asking the question "Is whatever we are testing statistically different from zero?" Thus, when someone says something is significant, without specifying a particular value, it is automatically assumed to be statistically different from (i.e., not equal to) zero. If someone states that something is different from a particular value (e.g., 27), then whatever is being tested is significantly different from 27. In both cases, since no direction was stated (i.e., greater than or less than), whatever is being tested can be either above or below the hypothesized value. Therefore, unless specificaly stated, the question of significance asks whether the parameter being tested is equal to zero (i.e., the null Ho), and if the parameter turns out to be either significantly above or below zero, the answer to the question "Is this parameter siginificant?" is yes (i.e., the null Ho is rejected).
**

Error df = 21, Total df = 24, SSR = 345, and SSE = 903.

Solve it and compare to the ANSWER

**2. The significance of the individual X's - the t-tests**

Our next step is to test the significance of the individual coefficients in the MR equation. We will conduct a t-test for each b associated with an X variable. Mechanically the actual test is going to be the value of b1 (or b2, b3.....bi) over SEb1 (or SEb1...SEbi) compared to a t-critical with n - (k +1) df or n-k-1 (the error df from the ANOVA table within the MR). Or we consider the p-values to determine whether to reject or accept Ho. (This is the same test as we performed insimple linear regression.) The null being tested by this test is Bi = 0. which means this variablethis variable is not related to Y. We consider each variable seperately and thus must conduct as many t-tests as there are X variables.

P-value for b1 = .006

P-value for b2 = .439

P-value for b3 = .07

Solve it and compare to the ANSWER

R-sqrd is the amount of variance in Y explained by the set of X variables. It is expressed as a percentage and thus goes from values of 0 - 100% (or 0 - 1 when expressed in decimal form). Adjusted R-sqrd is "adjusted" for the number of X variables (k in the formula) and the sample size (n in the formula). Both R-sqrd and adjusted R-sqrd are easily calculated. R-sqrd is SSR/SST and these can be pulled right out of the ANOVA table in the MR. The adjusted R-sqrd formula is shown on page 484 of the text. Again both of these can be calculated from the ANOVA table are always provided as part of the computer output.

Solve it and compare to the ANSWER

**Construct table** If Total df = 24 & Error df = 21 then Regression df must = 24-21 = 3 because total = error + regression. Also note that if total df = 24 than the sample size used to construct this MR must be 25 (total = n-1). Next step, if SSE = 903 and error df = 21 than MSE must equal SSE/error df = 903/21 = 43. If SSR = 345 and regression df = 3 then MSR = 345/3 = 115, and the F-ratio = MSR/MSE = 115/43 = 2.67

Our next task is to test the "significance" of this model based on that F-ratio using the standard five step hypothesis testing procedure.

**Hypotheses:** H0: all coefficients are zero

**Critical value:** an F-value based on k numerator df and n - (k +1) denominator df gives us F(3, 21) at .05 = 3.07

**Calculated Value:** From above the F-ratio is 2.67

**Compare:** t-calc < t-crit and thus do not reject H0.

**Conclusion:** This model has no explanatory power with respect to Y. In other words the set of X variables in this model do not help us explain or predict the Y variable. This model is NOT SIGNIFICANT. We would not use this model (in its current form) to make specific predictions of Y. There is no regression relationship between the Y variable and the X variables.

Return to Problem |

**Consider each p-value** By our standard if the p-value is less than .05 (our standard alpha) then we REJECT Ho. Thus for B1 we would reject (p < alpha), for B2 and B3 we would accept (p > alpha)

What NULL are we considering?

**Hypotheses:** we are testing H0: Bi=0 This variable is unrelated to the dependent variable at alpha=.05.

**Conclusion:** Variables X1 is significant and contributes to the model's explanatory power, while X2 and X3 do not contribute to the model's explanatory power. B1 does not equal 0, while B2 and B3 do = 0. These results suggest dropping variables X2 and X3 from the model and re-running the regression to test this new model.

**NOTE:** If instead of the p-values you were given the actual values of the b's and the SEb's, then you would be able to solve this by manually calculating the t-value (one for each X variable) and comparing it with your t-critical value (its the same for each t-test within a single model) to determine whether to reject or accept the Ho associated with each X.

Return to Problem |

**Calculate R-sqrd: ** SSR/SST, and SST = SSR + SSE = 45 + 55 = 100. Thus SSR/SST = 45/100 = .45 or 45%. Thus according to the sample this regression model explains 45% of the variance in the Y variable.

**Calculate adjusted R-sqrd:** 1 - (1 - .45)((n-1/n - (k+1)) = 1 - .55(29/25) = 1 - .55(1.16) = 1 - .638 = .362 or 36.2% of the variance in Y can be explained by this regression model in the population. Notice that adjusted R-sqrd dropped from R-sqrd. Whether or not these values of R-sqrd are good or bad depends on your own interpretation, but in this caes, 45% would probably be considered not very good, and other models would be examined.

Return to Problem |

Return to Index |

**1. How many dummy varibles are needed?** In a multiple regression there are times we want to include a categorical variable in our model. Examples might include gender or education level. Unfortunately we can not just enter them directly because they are not continuously measured variables. However, they can be represented by dummy variables. The answer to "how many?" is easy. It is r-1 where r = the number of categories in the categorical variable. Thus for gender (male - female) we would need only one dummy variable with a coding scheme of Xi=1 when the individual is male, and 0 when female. Thus female becomes the base case and the bi associate with Xi becomes the amount of change in Y when the individual is male versus female. For the education level example, if we have a question with "highest level completed" with categories (1) grammer school, (2) high school, (3) undergrad, (4) graduate, we would have 4 categories we would need 3 dummy variables (4-1). Thus we would create 3 X variables and insert them in our regression equation. We decide on our base case - in this example it will be grammer school. This category will not have an X variable but instead will be represented by the other 3 dummy variables all being equal to zero. We can make X1 = 1 for high school, X2 = 1 for undergrad and X3 = 1 for graduate. For each of these we are comparing the category in question to the grammer school category (our base case). The best way to lay this out is to build a little table to organize that coding. see below:

category/variable | X1 | X2 | X3 |

Grammer School | 0 | 0 | 0 |

High School | 1 | 0 | 0 |

Undergrad | 0 | 1 | 0 |

Graduate | 0 | 0 | 1 |

Thus no matter how many other variables are in the model, in order to include education level in your model you will have to add 3 new dummy variables (X's) to the model.

**2. How to Interpret Dummy Variables.**

When a MR equation is calculated by the computer you will get a b value associated with each X variable, whether they are dummy variables or not.The significance of the model and each individual coefficient is tested the same as before. Concluding that a dummy variable is significant (rejecting the null and concluding that this variable does contribute to the model's explanatory power) means that the fact that we know what category a person falls in helps us explain more variance in Y. So for instance in the example above with education level, if we test the B associated with X1 and determine it to be "significant" then that tells us that X1 (high school vs. grammer school) does contribute to the model's explanatory power. Thus by knowing whether a person has a high school education (versus on a grammer school education) helps us explain more of whatever the Y variable is. This process is repeated for each dummy variable, just as it is for each X variable in general.

**3. How to Use Dummy Variables in Prediction.**

Take the following model....

Y = 1000 + 25X1 + 10X2 - 30X3 + 15X4 where;

Y = annual sales dollars generated by an auto parts counter person

X4 = years of experience

X1, X2, & X3 are the dummy variables representing the education level for the counter person as coded in the table in section (2) from above.

**SOME QUESTIONS?**

(1) If a salesperson has a graduate degree how much will sales change according to this model compared to a person with a grammer shcool education?

(2) How much in sales will a counter person with 10 years of experience and a high school education generate?

(3) Why did we need three dummy variables to use "education level" in this regression equation?

**ANSWERS:**

**(1)** We need to isolate which of the dummy variables represents a person with a graduate degree and then the coefficient associated with that variable will represent how much a person with a graduate degree will generate in sales versus a person with a grammer school education. In this case we are asking which variable is coded 1 for a graduate degree, and from the table in part 2 we see that is X3. The b associated with X3 = -30 from the model above, and thus a person with a graduate degree will generate $30 less than a person with only a grammer school education level.

**(2)** Plug in the correct values for X1, X2, X3 & X4 and solve. X4 is easy, it is the experience level and is not a dummy variable so X4 = 10 in this case. X1 is going to =1 because the person's highest level completed is high school, X2 = 0, and X3 = 0 because when a person is in the high school category that is the value of those two variabled according to the table in part 2. This means that those two variables will drop out of the equation for this prediction because no matter what their b value is it will get multiplied by 0 and thus will = 0. Thus the equation will look like this...

**Y = 1000 + 25(1) + 10(0) - 30(0) + 15(10) = 1000 + 25 +150 = 1175**

This equation illustrates that no more than one of the dummy variables in the equation will end up staying in the equation for any given prediction. At least 2 of the dummy variables in this case had to equal zero because there were three total dummy variables. If we changed the question and said the person's highest level of education was grammer school, all three dummy variables (X1, X2 & X3) would have been equal to zero and the model would have only consisted of Y = 1000 + 15(10) which represents the sales generated by a clerk with 10 years of experience and only a grammer school education - the base case.

**(3)** We needed three dummy variables to represent the "eduction level" of the individual because there were 4 categories of eductation level (thus k=4) and **we always need k-1 dummy variables**.

Return to Index |

revised; 8-11-09