Morgan, Leech, Gloeckner, & Barrett (2013). Correlation and Regression (Chapter 9).

Morgan, G. A., Leech, N. L., Gloeckner, G. W., & Barrett, K. C. (2013). Correlation and Regression (Chapter 9). In IBM SPSS for Introductory Statistics: Use and Interpretation (5th edition, pp. 149–170). New York: Routledge.

“In this chapter, you will learn how to compute several associational statistics, after you learn how to make scatterplots and how to interpret them. An assumption of the Pearson product moment correlation is that the variables are related in a linear (straight line) way so we will examine scatterplots to see if that assumption is reasonable. Second, the Pearson correlation and the Spearman rho will be computed. The Pearson correlation is used when you have two variables that are normal/scale, and the Spearman is used when one or both of the variables are ordinal. Third, you will compute a correlation matrix indicating the associations among all the pairs of three or more variables. Fourth, you will compute simple or bivariate regression, which is used when one wants to predict scores on a normal/scale dependent (outcome) variable from one normal or scale independent (predictor) variable. Last, we will provide an introduction to a complex associational statistic, multiple regression, which is used to predict a scale/normal dependent variable from two or more independent variables.” (p 149)

[“Assumptions and Conditions for the Pearson Correlation (r) and Bivariate Regression” (p 149) …]

  1. “The two variables have a linear relationship. …
  2. “Scores on one variable are normally distributed for each value of the other variable and vice versa. If degrees of freedom are greater than 25, failure to meet this assumption has little consequence. Statistics designed for normally distributed data are called parametric statistics. …
  3. “Outliers (i.e., extreme scores) can have a big effect on the correlation.” (p 149)

[“Assumptions and Conditions for Spearman Rho (rs)” (p 149) …]

  1. “Data on both variables are at least ordinal. Statistics designed for ordinal data and which do not assume normal distribution of data are called nonparametric statistics.
  2. “Scores on one variable are monotonically related to the other variable. This means that as the values of one variable increase, the other should also increase but not necessarily in a linear (straight line) fashion. The curve can flatten but cannot go both up and down as in a U or J.”

[“Problem 9.1 : Scatterplots to Check the Assump1tion of Linearity” (p 150) …]

“A scatterplot is a plot or graph of two variables that shows how the score for an individual on one variable associates with his or her score on the other variable. If the correlation is _high positive_, the plotted points win be close to a straight line (the linear regression line) from the lower left comer of the plot to the upper right. The linear regression line will slope downward from the upper left to the lower right if the correlation is _high negative_. For correlations _near zero_, the regression line will be flat with many points far from the line, and the points form a pattern more like a circle or random blob than a line or oval.” (p 150)

“… it may show that a better fitting line would be a curve rather than a straight line. In this case the assumption of a linear relationship is violated and a Pearson correlation would not be the best choice. The Spearman or Kendall’s tau correlations would be better.” (p 150)

[“Problem 9.2: Bivariate Pearson and Spearman Correlations” (p 155) …]

“The Pearson product moment correlation is a bivariate parametric statistic used when both variables are approximately normally distributed (i.e., scale data). When you have ordinal data or when assumptions are markedly violated, one should use a nonparametric equivalent of the Pearson correlation coefficient. One such nonparametric, ordinal statistic is the Spearman rho (another is Kendall’s tau, …)” (p 155)

[“Interpretation of Output 9.2” (p 157) …]

“Note that the degrees of freedom (N – 2 for correlations) is put in parentheses after the statistic (r for Pearson correlation), which is usually rounded to two decimal places and is italicized, as are all statistical symbols using English letters.” (p 157)

“The nonparametric Spearman correlation is based on ranking the scores (1st, 2nd, etc.) rather than using the actual raw scores. It should be used when the scores are ordinal data or when assumptions of the Pearson correlation (such as normality of the scores) are markedly violated. Note, you should _not_ report both the Pearson and Spearman correlations …” (p 157)

[“Example of How to Write about Problem 9.2” (p 158) …]

“Thus, the Spearman rho statistic was calculated, _r_(73) = .32, _p_ = .006. … The _r_2 indicates that approximately 10% of the variance in math achievement test scores can be predicted from mother’s education.”

[… ie, 0.322 = 0.1024 = 10%]

[“Problem 9.3: Correlation Matrix for Several Variables” (p 158) …]

“The Bonferroni correction is a conservative approach designed to keep the significance level at .05 for the whole study. Using Bonferroni, you would divide the usual significance level (.05) by the number of tests. In this case a _p_ < .008 (.05/6) would be required for statistical significance. Another approach is simply to set alpha (the _p_ value required for statistical significance) at a more conservative level, perhaps .01 instead of .05.” (p 159)

“One-tailed tests are only used if you have a clear directional hypothesis (e.g., there will be a positive correlation between the variables).” (p 160)

[“Problem 9.4: Bivariate or Simple Linear Regression” (p 160) …]

“Correlations do not indicate prediction of one variable from another; however, there are [page break] times when researchers wish to make such predictions. To do this, one needs to use bivariate regression (which is also called simple regression or simple linear regression). Assumptions and conditions for simple regression are similar to those for Pearson correlations; the variables should be approximately normally distributed and should have a linear relationship.” (p 160-161)

[Adapted from “Coefficients” table (p 162) …]


[Referring to data for “grades in h.s.” under “Unstandardized Coefficients,” column “B” …]

“This [2.142] is the regression coefficient, which is the slope of the best fit line or regression line. Note that it is not equal to the correlation coefficient. The standardized regression coefficient (.504) for simple regression is the correlation.” (p 162)

[“Interpretation of Output 9.4” (p 162) …]

Model Summary table … Variables Entered/Removed table … ANOVA table … Coefficients [table] …” (p 162)

“The Unstandardized Coefficients give you a formula that you can use to predict the _y_ scores (dependent variable) from the _x_ scores (independent variable). Thus, if one did not have access to the real _y_ score, this formula would tell one the best way of estimating an individual’s _y_ score based on that individual’s _x_ score. For example, if we want to predict _math achievement_ for a similar group knowing only _grades in h.s._, we could use the regression equation to estimate an individual’s achievement score; predicted _math achievement_ = .40 + 2.14 x (the person’s _grades_ score). Thus, if a student has mostly Bs (i.e., a code of 6) for their grades, their predicted _math achievement_ score would be 13.24; _math achievement_ = .40 + 2.14 x 6.

“One should be cautious in doing this because we know (from the Model Summary table) that _grades in h.s._ only explains 24% of the variance in _math achievement_, so this would not yield a very accurate prediction.” (p 163)

[“Problem 9.5: Multiple Regression” (p 163) …]

“The purpose of multiple regression is similar to bivariate regression, but with more predictor variables. Multiple regression attempts to predict a normal (i.e., scale) dependent variable from a combination of several normally distributed and/or dichotomous independent/predictor variables.” (p 163)

[“Assumptions and Conditions of Multiple Regression” (p 164) …]

“There are many assumptions to consider, but we will only focus on the major ones that are easily tested. These include the following: the relationship between each of the predictor variables and the dependent variable is linear, the errors are normally distributed, and the variance of the residuals (difference between actual and predicted scores) is constant. A condition that can be problematic is multicollinearity; it occurs when there are high intercorrelations among some set of the predictor variables. In other words, multicollinearity happens when two or more predictors are measuring overlapping or similar information.” (p 164)

[“Interpretation of Output 9.5” (p 168) …]

“Because several independent variables were used, a reduction of the number of variables might help us find an equation that explains more of the variance in the dependent variable, once the _R_2 is adjusted. It is helpful to use the concept of parsimony with multiple regression and use the smallest number of predictors needed.” (p 168)


See this page at