REGRESSION ANALYSIS:

(Note: The following comments and examples of regression analysis are meant to complement the readings on regression analysis: SPSS Applications Guide, Chapter 12 and Munro, Chapter 12 up to p.274. They are not meant as a complete guide to regression analysis.)

(A) Regression and Analysis of Variance:

Like Analysis of Variance and T-Test, Regression Analysis is also part of the group of statistical models known as the General Linear Model (GLM). They all have in common that they start with a measure of the total variation of the scores in the outcome/dependent variable. This measure is known as the total sum of squares (TSS), which is defined as: Sum (Yi - Y(bar))2. In other words, we start with the sum of all squared deviations around the overall (sample or population) mean. Incidentally, since the GLM starts with the squared deviations from the mean of the outcome variable, it must be possible to compute a mean and the distances/deviations from the mean must be defined. This requires that the outcome variable is measured at the interval or ratio level. (Together, the two kinds of variables are often called 'continuous' variables).

The differences between analysis of variance models and regression models primarily have to do with the properties of the independent or predictor variables. In analysis of variance models, the independent variables tend to be nominal-level (or categorical) variables with arbitrary values assigned to the categories. (Note: independent variables in analysis of variance models are often called 'factors'.) Independent variables in regression analysis are often themselves continuous (interval or ratio-level) variables. Both regression and analysis of variance models can handle all types of independent variables; however, it is generally easier to use regression, when most of the independent variables are continuous (since categorical variables must be represented through the often cumbersome method of 'dummy-coding'), and to use analysis of variance, when most of the indepedent variables are categorical factors. (In analysis of variance models, continuous independent variables are treated as 'covariates', in which case the models are also called 'analysis of covariance'.)

One last piece of terminology should be mentioned here. We started by saying that all GLM models begin with the TSS. Then the fundamental question becomes: how much of the variation in the particular outcome measure considered can be associated with or attributed to any and all of the independent or predictor variables? This question is answered by decomposing the TSS into two fundamental components: the 'explained' sum of squares and the 'unexplained' or 'error' sum of squares. The 'explained' sum of squares is the amount of variation in the dependent variable 'associated with' all the independent variables or factors. The 'unexplained' sum of squares is the amount of variation in the dependent variable that is completely independent of the independent variables or factors. Unexplained or error sum of squares may represent measurement error in the outcome variable or it may represent ystematic variation that is NOT associated with the predictor variables that are part of the model. It is always true, by definition, that TSS = explained SS + unexplained SS. It is important to note here, that in analysis of variance and regression we use slightly DIFFERENT terminology for the SAME sums of squares. In analysis of variance, the unexplained sum of squares is called 'within group sum of squares' or WGSS, while the same measure is called 'residual sum of squares' or 'error sum of squares' (ESS) in regression analysis. Likewise, in analysis of variance, the sytematic variation associated with the independent variable(s) is usually called 'between-group-sum-of-squares' or BGSS; this same sum of squares is called 'regression sum of squares' or RSS in regression analysis. Thus, we get in analysis of variance: TSS = BGSS + WGSS, and in regression analysis: TSS = RSS +ESS.



(B) Simple Regression:

So much for the preliminary terminology. Now, let's focus on regression analysis. Our first example introduced in class was one of a SIMPLE (linear) regression. In a simple regression, there is only one independent variable. Being a regression model, BOTH the dependent and independent variables are continuous. Here, we focus on patient depression as the outcome/dependent variable and the count of symptoms reported by the cancer patients as the independent variable. It goes without saying that our implied research hypothesis is: as the number of reported symptoms (like fatigue, nausea, vomiting, bleeding, swelling, etc.) increase, we expect the depression score to increase as well.

The tables below show the output from the SPSS regression run.

Descriptive Statistics


Mean
Std. Deviation N
PCESD depression score (patient)

11.02


7.73


783
PSYMCNT count of all reported symptoms

7.96


4.42


783


This table shows the descriptive information about both the dependent and independent variables.

It tells us that the mean CESD depression score in this sample of 783 cancer patients is 11.02 with considerable variation in scores as indicated by the standard deviation of 7.73. (Incidentally, this table does not show that the depression scores range from a minimum of 0 to a maximum of 42 among these 783 cases.) The number of reported symptoms average almost 8 for the sample of 783 cases (they range from 0 to 27 - not shown in this table) and also display considerable variation: the standard deviation equals 4.42.

Correlation
PSYMCNT count of all reported symptoms
Pearson Correlation PCESD depression score

.494
Sig. (1-tailed)

.000
N 783






The correlation table shows that, as expected, symptom count and depression scores correlate positively (which means: higher numbers of symptoms are associated with higher depression scores) and fairly strongly. The correlation is .494. It is highly significant (p-value < .0005), which means that we reject the null-hypothesis that the observed sample correlation is the result of a sampling fluke.



Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .494 .244 .243 6.7245

a Predictors: (Constant), PSYMCNT count of all reported symptoms

The model summary provides useful information about the regression analysis. First, in the 'model' column, it tells us that there is only one independent variable that was entered into this model. The next column represents the 'multiple R'. This is the correlation between the actually observed dependent variable and the dependent variable as it is predicted by the regression equation. In a simple regression with only one independent variable, this is the same as the simple Pearson's correlation between the dependent and independent variable. 'R square' is the square of R and is also known as the 'coefficient of determination'. It tells us what proportion (or percentage) of the (sample) variation in the dependent variable can be attributed to the independent variable(s). In our example, we can say that 24.4% of the variation in depression scores among these cancer patients appears to be accounted for by the variation in their reported number of symptoms. (The 'adjusted R square' refers to the best estimate of R square for the population from which the sample was drawn.) Finally, the 'standard error of estimate' tells us that, on average, observed depression scores deviate from the predicted regression line by a score of 6.7. This is not surprising, since we already know that our regression model explains 24.4% of the variation, it can NOT account for the other 75.6% which most likely represents both measurement error in depression as well as other factors that influence depression that we have not considered.

ANOVA
Model Sum of Square

df
Mean Square

F


Sig.
1 Regression 11412.06 1 11412.06 252.38 .000
Residual 35315.61 781 45.22
Total 46727.67 782

a Predictors: (Constant), PSYMCNT count of all reported symptoms

b Dependent Variable: PCESD depression score (patient)

This table represents the results from the analysis of variance associated with the regression analysis. In the 'sum of square' column you find the decomposition of the total sum of squares into regression sum of squares (= 'explained') and residual sum of squares (= 'error'). If you divide the regression SS into the TSS, or 11,412.06/46,727.67 = .244, you get 'R square' the proportion of variation in the dependent variable accounted for by the independent variable. The 'mean square' column shows the average variation associated with the regression and the residuals and is computed by dividing the SSs by their respective degrees of freedom. (Note: The full explanation of the concept of degrees of freedom is beyond this course. It can be shown that, if one already knows the equation for the regression line and one knows N-2 individual observed sample values for the CESD score, one can reconstruct the final 2 sample values from this knowledge. Thus only N-2 sample values (=residual df.) are truly free to vary in the sampling process. The mathematical proof requires calculus and is a bit involved.) Important for your understanding in interpreting this output is the next column, containing the F- statistic. The F-statistic is a ratio of two numbers, the mean square (or average variation) associated with the regression and the mean square (or average variation) associated with the residuals or errors. In other words, the F-statitistic represents a ratio of explained variance to unexplained variance. It i clear that if the independent variable(s) is/are not at all associated with or predict variation in the dependent variable, then the regression sum of square equals 0 and F equals 0 also. The larger the regression sum of squares in relation to the residual sum of squares, the more of the variation in the dependent variable is explained by the independent variable(s). Of course, if RSS grows relative to ESS, so does the F-value. This is the basis for an important statistical test. The null-hypothesis underlying the F-test is: all independent variables combined have NO effect on the dependent variable, thus F = 0 in the population from which the sample was drawn. However, because of sampling fluctuation, we would never expect to observe an F-value of exactly zero. So our usual question becomes: Is the observed sample F-value so large, that it is unlikely that mere random sampling fluctuation could have produced it? Our observed F-value of 252.38 has a p-value of .000 associated with it. Thus fewer than 1 in a thousand samples would randomly produce such a large F-value if the samples come from a population in which the true F-value is 0. Thus, we reject the null-hypothesis that the independent variable(s) don't explain any variation in the dependent/outcome variable. Conclusion: in the population of canecr patients from which this sample was drawn, depression is associated with symptom counts. (Note, since in a simple regression there is only one independent variable, a significant F-test already tells us that the (only) independent variable has a significant effect.

Coefficients






(Constant)
PSYMCNT count of all reported symptoms
Unstandardized Coefficients B 4.14 .865
Std. Error .496 .054
Standardized Coefficients Beta .494
t 8.35 15.886
Sig. .000 .000
95% Confidence Interval for B Lower Bound 3.16 .758
Upper Bound 5.12 .971

a Dependent Variable: PCESD depression score (patient)

This table gives the actual estimates for the regression equation. The row of 'unstandardized coefficients' or 'Bs' gives us the necessary coefficient values for the simple regression model. The 'constant' of 4.14 represents the intercept in the equation and the coefficient in the column labeled by the independent variable (X = symptom count) represents the slope coefficient. At the bottom of the table, we are told that the dependent variable (= Y) is the depression score. Thus, this regression equation is:

Y(hat) = 4.14 + .865X, where Y(hat) is the predicted value of Y (or the predicted depression score, and X = the symptom count, which is the predictor variable.

(Comment: How was this regression equation arrived at? The values for the intercept and the slope coefficient were chosen in such a way that the squared deviations of observed scores from the regression line are minimized. In this sense, the regression line is the 'line of best fit'. (How are the values of the intercept and the slope arrived at? Again, through the application of calculus, it is actually possible to compute the exact values for the intercept and the slope, based solely on sample information on the Xs and Ys. Any advanced textbook on regression analysis contains the derivation of these so-called 'normal equations'. Since the method leads to minimum squared deviations of observed from predicted values, it has come to be known as the method of 'ordinary least squares' estimation.)

Let's go back to the regression equation.

The equation Y(hat) = 4.14 + .865 X is a sample estimate of the true population equation. Thus, before we can interpret it, we need to answer our usual question: are the observed sample values indicative of real effects in the population. As usual, we start with a null-hypothesis of 'no effect'. In this case of a simple regression, the null-hypothesis would be: symptoms do not influence depression in cancer patients. This verbal null-hypothesis would translate into a population regression equation, in which the slope of the regression equation equals zero. Why, if the slope coefficient (associated with X) equals zero, then any changes in X would have no effect on Y, the dependent depression score. As we know by now, even though the slope coefficient may be exactly equal to zero in the population, repeated sampling from this population will likely result in a sample estimates of the slope coefficient that differ from one sample to the next. Thus our 'eternal question' is again: are the sample estimates of the slope coefficient so large, that it is unlikely that mere sampling fluctuations cold have produced them? As always, we decide this questions by comparing the observed sample estimate to the size of its standard error (which indicates by how much such sample estimates vary, on average, from one sample to the next). In our case, the standard error associated with the slope coefficient is .054 which is extremely small in relation to the sample estimate of .865. In fact, if you divide .865 by .054, yo get the t-value of 15.886. It tells you that the observed sample slope coefficient is almost 16 standard errors larger than zero. Since the sampling distribution of the slope coefficient follows the t-distribution, we only need to ask the question, how likely is it that mere random sampling produces a slope coefficient that differs from zero by almost 16 standard errors. The probability of that happening by chance is practically non-existent. As the associated p-value of .000 tells us, we reject the null-hypothesis that this sample was drawn from a population in which the symptom count does not affect the regression score. The same logic applies to the intercept coefficient. It too differs significantly from zero since it is more than 8 standard errors from zero. (Note: since regression coefficients follow the t-distribution, researchers actually have adopted a simple rule of thumb: any coefficient that is larger than twice its standard error is 'statistically significant' at the .05 level.) At the bottom of the 'coefficients' table, you also see information about the 95% confidence intervals. What do they tell us? We are 95% confident that the true population intercept coefficient lies somewhere between 3.16 and 5.12 and the true population slope coefficient lies somewhere between .758 and .971.

Finally, we are ready to interpret the regression equation. Again, our best estimate of the population regression equation is:

Y(hat) = 4.14 + .865X, with both estimated coefficients differing from zero.

In our example, the observed range of the independent variable actually includes zero. Thus, there are cancer patients who do not report any of these symptoms. This regression equation now tells us, that we expect such cancer patients to have, on average, a depression score of 4.14 (equal to the intercept since X = 0). Now, let's look at a cancer patient who reports 10 symptoms. We predict that such a patient will typically have a depression score of 12.79 (= 4.14 + .865 x 10). Thus, we see that the slope coefficient provides us with the most important information: it shows us by how much the dependent (depression) score changes for a a change in the independent (symptom) score by one unit.

(C) Multiple Regression:

The following tables represent results from a multiple regression analysis. It has again only one dependent variable, but contains more than 1 independent or predictor variables. The dependent or outcome variable is again the CES-D depression scale. This time, however, we have added patient sex (1=female, 0=male), patient age (in years, ranging from 64 to 98), number of comorbid conditions (a count of chronic diseases, such as arthritis, diabetes, etc., ranging from 0 to a maximum of 9), the physical functioning subscale of the SF-36 (ranging from 0 = complete immobility to 100 = 'perfect' functioning) and the patient symptom count (0-27). The first table, again shows the descriptive sample statistics for all these variables.

Descriptive Statistics


Mean
Std. Deviation

N
PCESD depression score (patient) 10.84 7.65 746
PSEX2 patient sex (recoded) .47 .50 746
PAGE patient age (in years) 72.20 4.99 746
PCOMORBI Patient Comorbity Count 2.71 1.68 746
MOSPF: Pt. Phys.Functioning 63.98 28.16 746
PSYMCNT count of reported symptoms 7.87 4.42 746


Notice that, except for patient sex, all of the independent variables are continuous variables with meaningful, interpretable means and standard deviations. (E.g., the mean physical functioning score in the sample is 63.98 with the average deviation around the mean being 28.16, etc.) The only variable that is NOT an interval level variable is sex. Regression can accommodate such nominal-level categorical variables, if they have only two categories and are 'dummy-coded', that is to say, one category takes on the value '1', the other the value '0'. (Which one is coded one or zero is arbitrary.) In this special case, the mean of .47 simply indicates that 47% of the sample are female (since we coded 1 = female). As we will see below, regression coefficients of such dummy variables also have simple interpretations.

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
5 .520 .271 .266 6.5565

a Predictors: (Constant), PSYMCNT count of all reported symptoms, PSEX2 patient sex (recoded), PAGE patient age (in years), PCOMORBI Patient Physical Comorbity (Count) , MOSPFCU SF-36: Pt.Physical Functioning-(CU)

This 'Model Summary' table looks exactly like the one for simple regression, and the statistics in it have exactly the same interpretation. This time, we have a model with five independent or predictor variables of the dependent variable, which is depression. The multiple 'R' again indicates size of the correlation between the observed outcome variable and the predicted outcome variable (based on the regression equation). R2 or the coefficient of determination again indicates the amount of variation in the dependent scores attributable to ALL independent variables combined, with the 'adjusted R2 ' again giving an estimate for the population value of R2. Finally, the 'standard error of estimate' gives us an indication of the average spread of observed depression scores around the predicted regression line. When you compare these results to the same table from the simple regression above, you will see great similarities, except that 'R-square' is now a bit larger (.271 instead of .244) and the standrad error of the estimate is slightly smaller (6.5565 instead of 6.7245). What this means is that the additional independent variables allow us to predict cancer patient depression a little bit better (now, we can account for 27.1% of the variation in depression scores instead of 24.4%). As a result, the average 'error' or unexplained variation around the regression line is a bit smaller.

ANOVA
Model Sum of Squares df Mean Square F Sig.
1 Regression 11803.415 5 2360.683 54.915 .000
Residual 31811.282 740 42.988
Total 43614.697 745

a Predictors: (Constant), PSYMCNT count of all reported symptoms, PSEX2 patient sex (recoded), PAGE patient age (in years), PCOMORBI Patient Physical Comorbity (Count) , MOSPFCU SF-36: Pt.Physical Functioning-(CU)

b Dependent Variable: PCESD depression score (patient)

The ANOVA table again has the same interpretation as the one for a simple regression. It decomposes the total sum of squares into regression (=explained) SS and residual (=unexplained) SS. The ratio of regressionn SS over total SS, or 11,803.415/43,614.697 = .271 is, of course, identical to R2. Finally, the F-test is again the ratio of the average deviations of the regression line from the sample mean (mean regression SS) and the squared deviations from the regression line (= mean residual SS). Thus it represents the relative magnitude of explained to unexplained variation. The F-statistic is highly significant (p=.000), thus we reject the null-hypothesis that none of the independent variables predicts the depression scores in the population.

Coefficients
Model
1
(Constant) PSEX2 patient sex (1=female, 0=male) PAGE patient age (in years) PCOMORBI Patient Physical Comorbity MOSPF: Pt.Physical Functioning PSYMCNT count of all reported symptoms
Unstandardized Coefficients B 8.978 1.514 -.02411 .07200 -.04414 .701
Std. Error 3.666 .493 .049 .154 .010 .062
Standardized Coefficients Beta .099 -.016 .016 -.162 .405
t 2.449 3.069 -.489 .467 -4.474 11.271
Sig. .015 .002 .625 .640 .000 .000
95% Confidence Interval for B Lower Bound 1.781 .545 -.121 -.230 -.064 .579
Upper Bound 16.175 2.483 .073 .374 -.025 .824

a Dependent Variable: PCESD depression score (patient)

Again, we use the 'coefficients' table to construct the regression equation. It is:

Y = 8.978 + 1.514 X1 - .024 X2 + .072 X3 - .044 X4 + .701 X5,

where X1 = Patient Sex, X2 = Patient Age, X3 = Count of Patient Comorbidities, X4 = Patient Physical Functioning Score, and X5 = Symptom Count.

Do any of these regression or slope coefficients differ significantly from zero? We answer this question by looking at the magnitude of the coefficients in relation to their standard errors. The row of t-values gives us the ratio of the regression coefficients to their standard errors; for instance, the t-value for Patient sex is 3.069 which equals 1.514/.493. What does this tell us? Right below the t-values, you see the p-values or significance values associated with the t-values. This gives us all the pieces we need to draw statistical inferences about the population. We start with null-hypothesis for each independent variable, namely, that it has no effect on the outcome variable (here: depression). This is the same as saying that the particular regression coefficient we are focusing on is assumed to be zero in the population from which the sample is drawn. Now we start again with our familiar assumption: let us say, we assume that patient sex has no effect on depression. In that case, the true regression coefficient in the population associated with the sex variable ought to be zero. In our sample, however, we observe a sex regression coefficient of 1.51. How likely is that to happen as a result of mere sampling chance? The answer is that a coefficient of 1.514 is actually more than 3 standard errors larger than zero, and that occurs by chance in only 2 out of a 1000 samples drawn from this population. Thus we reject the null hypothesis (conventionally, as long as p<.05): we are quite confident that sex does have a real effect on depression. The same logic applies to all the other regression coefficients. As you can see, two of them, the coefficients for patient age and the number of comorbid conditions do NOT differ significantly from zero. In fact, there is a more than 60% likelihood that each of these observed sample coefficients are the result of sampling chance. Thus, we don't take them to be 'real' and conclude that the population regression coefficients for these variables equal zero.

This simplifies our regression equation to:

Y = 8.978 + 1.514 X1 - .044 X4 + .701 X5, since only X1, X4 and X5 are significant predictors of depression.

If we now substitute particular values for these variables, we can compute the predicted depression score. For instance, a female cancer patient with moderate physical functioning (score of 60) and 5 reported cancer symptoms will, on average, have a depression score of 11.357 (= 8.978 + 1.514 x 1 - .044 x 60 + .701 x 5).

The 'coefficients' table also contains the standardized regression coefficients. These are useful only in multiple regressions with at least two independent variables. The problem with the unstandardized coefficients is that they are measured in different units of measurement (an increment of one unit means, for instance, the difference between m\female and male in the sex variable, one additional symptom in the symptom variable, or one unit score on the physical functioning score). With these different units of measurement, we can not directly answer the question, which of these variables has the strongest effect on depression because we are comparing 'apples' and 'oranges'. This answer is given by the standardized coefficients, often also called 'betas'. They tell us by how many standard deviations the dependent variable changes for a change in the independent variables by one standard deviation. Using this common yardstick, we easily see that the reported symptoms have the strongest effect on depression.





(D) Hierarchical Regression Models:

In this section we briefly discuss hierarchical ordering of (groups of) independent variables in multiple regressions. If we include the same independent variables in our regression model, it does not matter which order we enter them into the equation, the regression or slope coefficient will be exactly the same. (They only change if we add or omit an independent variable, because the multiple regression procedure also adjusts all estimates for the presence of the other variables in the equation). However, even though the regression coefficients do not change as long as we have the same independent variables, the order in which they are entered does affect the amount of variation that is 'explained' by (or attributed to) the independent variables. Except for the limiting case, where all correlation among the independent variables are zero, it is actually impossible to attribute variation in the dependent variable uniquely to each of the independent variables. When there is overlap (or correlated independent variables), part of the explained variation in the dependent variable is explained jointly by two of more independent variables. In this case, regression analysis (as well as analysis of variance) attributes the joint variation always to the variables entered earlier into the equation. The result is, that changing the order of entering changes the amount of variation attributed to various independent variables. The following two

summary tables show that, they are from the same multiple regression model as before. Only that this time, variables are entered block-wise or in groups, with age and sex entered first, followed by comorbid conditions and physical functioning and, finally, by the symptom count. Concentrate on two columns, the one for R square and R square change, they tell the essential story. In the first table, the model attributes and additional 12.9% of variation in depression scores to comorbid conditions and physical functioning and another 12.5% to symptoms. Of course, altogether, all 5 variables (with the demographics included) account for 27.1% of the variation in depression scores. Now look at the second sumary table. The only change made was that the symptom variable was entered BEFORE the comorbid conditions and physical functioning. Now, 23.3% of the variation in depression is attributed to symptoms and only an additional 2.2% to comorbid conditions and physical functioning. Clearly, the way in which one presents one's tables may sway the uninitiated reader in one direction or another. Just remember the most important result: except in the case of uncorrelated independent variables or factors (which usually occur only in clinical trials where factors are unrelated by design as a result of random assignment), it is NOT possible to attribute variation in the outcome variable uniquely to one or the other independent variable, and research reports using regression of analysis of variance models on observational models should generally NOT emphasize 'amounts of variation attributed o one or the other independent variable'.

Model Summary
R R Square Adjusted R Square Std. Error of the Estimate Change Statistics
Model R Square Change F Change df1 df2 Sig. F Change
1 .128 .016 .014 7.5990 .016 6.151 2 743 .002
2 .381 .145 .141 7.0922 .129 55.988 2 741 .000
3 .520 .271 .266 6.5565 .125 127.028 1 740 .000

a Predictors: (Constant), PAGE patient age (in years), PSEX2 patient sex (recoded)

b Predictors: (Constant), PAGE patient age (in years), PSEX2 patient sex (recoded), PCOMORBI Patient Physical Comorbity (Count) , MOSPFCU SF-36: Pt.Physical Functioning-(CU)

c Predictors: (Constant), PAGE patient age (in years), PSEX2 patient sex (recoded), PCOMORBI Patient Physical Comorbity (Count) , MOSPFCU SF-36: Pt.Physical Functioning-(CU), PSYMCNT count of all reported symptoms

Model Summary
R R Square Adjusted R Square Std. Error of the Estimate Change Statistics
Model R Square Change F Change df1 df2 Sig. F Change
1 .128 .016 .014 7.5990 .016 6.151 2 743 .002
2 .499 .249 .246 6.6435 .233 230.090 1 742 .000
3 .520 .271 .266 6.5565 .022 10.907 2 740 .000

a Predictors: (Constant), PAGE patient age (in years), PSEX2 patient sex (recoded)

b Predictors: (Constant), PAGE patient age (in years), PSEX2 patient sex (recoded), PSYMCNT count of all reported symptoms

c Predictors: (Constant), PAGE patient age (in years), PSEX2 patient sex (recoded), PSYMCNT count of all reported symptoms, PCOMORBI Patient Physical Comorbity (Count) , MOSPFCU SF-36: Pt.Physical Functioning-(CU)