Transcript
Chapter 12: Bivariate Regression
Chapter Objectives
When you finish this chapter you should be able to
calculate and test a correlation coefficient for significance.
explain the OLS method and use the formulas for the slope and intercept.
fit a simple regression on an Excel scatter plot.
perform regression by using Excel and another package such as MegaStat.
interpret confidence intervals for regression coefficients.
test hypotheses about the slope and intercept by using t tests.
find and interpret the coefficient of determination R2 and standard error syx.
interpret the ANOVA table and use the F test for a regression.
distinguish between confidence and prediction intervals.
identify unusual residuals and high-leverage observations.
test the residuals for non-normality, and heteroscedasticity.
explain the role of data conditioning and data transformations.
Quiz Yourself
True/False Questions
An inverse relationship between an independent variable x and a dependent variably y means that as x increases, y decreases, and vice versa.
The regression line = 2 + 3x has been fitted to the data points (4,11), (2,7), and (1,5). The sum of squares for error will be 10.0.
In a simple linear regression model, testing whether the slope of the population regression line could be zero is the same as testing whether or not the population coefficient of correlation equals zero.
If the coefficient of correlation is –0.81, then the percentage of the variation in y that is explained by the regression line is 81%.
Except for the values r = -1, 0, and 1, we cannot be specific in our interpretation of the coefficient of correlation r. However, when we square it we produce a more meaningful statistic.
Given that SSE = 84 and SSR = 358.12, the coefficient of correlation (also called the coefficient of correlation) must be 0.90.
The coefficient of determination is the coefficient of correlation squared. That is, .
The value of the sum of squares for regression SSR can never be larger than the value of total sum of squares SST.
In regression analysis, if the coefficient of determination is 1.0, then the coefficient of correlation must be 1.0.
Multiple Choice Questions
A simple linear regression generated a correlation coefficient of 0.01. This tells us that
SSR is almost zero.
SSE is almost zero.
the two variables barely relate to each other.
we shall reject the null at less than a 5% significance level.
What randomness exists in the linear regression model?
The randomness from the explanatory variables, the X's.
The randomness from what is unexplained, the error.
The randomness of the dependent variable, the Y's.
None of the above.
A regression analysis between sales (in $1000) and advertising (in $100) resulted in the following least squares line: Sales' = 75 + 6*(Advertising). This implies that if advertising is $800, sales will be
$4,875
$12,300
$123,000
$487,500
Two models were proposed for a simple regression of tree height on bark thickness, Model A: Height’ = 7.8*Bark + 37 and Model B: Height’ = 8*Bark + 35. Using the information and calculations below, which model is best?
Model A: Height’ = 7.8*Bark + 37
Tree ID
Height (feet)
Bark Thickness (millimeters)
Predicted Height
Error
Squared Error
1
150
15
2
175
18
177.4
-2.4
5.76
3
225
21
200.8
24.2
585.64
4
200
23
216.4
-16.4
268.96
Model 8: Height’ = 8*Bark + 35
Tree ID
Height (feet)
Bark Thickness (millimeters)
Predicted Height
Error
Squared Error
1
150
15
155
-5
25
2
175
18
179
-4
16
3
225
21
203
22
484
4
200
23
A. Model A
B. Model B
C. The models are identical.
D. It is impossible to determine the best model.
The least squares line is the line guaranteed to
A. be the one line of all possible lines around which the smallest square can be drawn.
B. be the line of all possible lines that connects the most observations with the fewest turns.
C. be the line of all possible lines that has the smallest squared sum of the distance between observations and predictions.
D. be the line of all possible lines that has the smallest sum of squared distance between observations and predictions.
Wasiq plans on selling his home and wishes to come up with a simple way to determine the asking price he will advertise. The following is a partial Excel output for a regression of Price (in $K) on number of rooms in the house (Rooms). Use this output to answer the next eight questions.
SUMMARY OUTPUT
Regression Statistics
R Square
Adjusted R Square
Standard Error
9.97
Observations
ANOVA
df
SS
MS
F
Significance F
Regression
1
Residual
18
1790.79
Total
19
7653.73
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
43.34
3.66
0.002
18.47
68.21
Rooms
9.97
1.3
7.68
4.39E-07
7.24
What percent of the variability in the price of a home is explained by the variability in the number of rooms?
23.4% B. 58.6% C. 76.6% D. 99.4%
The predicted value for the fourth observation in the data set used for this regression is $163.01. The observed value is $146.50. What is the residual?
A. -16.51 B. 16.51 C. 306.6 D. 308.51
How many homes were in Wasiq’s sample?
A. 1 B. 18 C. 19 D. 20
Wasiq will only use this simple method to determine the asking price for his home if the number of rooms is an important predictor for the price of a home. Given the output from his regression, what should Wasiq do?
A. Reject the null and don’t use the method.
B. Reject the null and use the method.
C. Fail to reject the null and don’t use the method.
D. Fail to reject the null and use the method.
What sign should be on the correlation coefficient for this regression?
A. Positive, because the correlation coefficient is squared.
B. Negative, because the standard error is small.
C. Positive, because the coefficient on Rooms is positive.
D. Negative, because the standard error is large.
What is the margin of error for a 95% confidence interval for the coefficient on Rooms from this regression?
A. 12.7 B. 7.24 C. 5.46 D. 2.73
Wasiq’s home has 7 rooms. According to the output, what should he advertise as the asking price for his home?
A. $313,350 B. $113,130 C. $69,790 D. $53,310
According to the regression output, how much more could Wasiq ask if his home had two more rooms?
A. $133,070 B. $19,940 C. $9,970 D. $43,300
Which of the following are ethical regression analysis behaviors?
A. Remove variables from the model because the regression shows them to be non- significant.
B. Remove observations from the data set which are accurate but are outliers.
C. Use the significance test results of a preliminary regression to develop a model, then re- run the regression to get reportable results.
D. Correct observations in the data set which are inaccurate outliers.
Solved Problems from Text
12.2 a. The scatter plot shows a positive correlation between hours worked and weekly pay.
b.
Hours Worked (X)
Weekly Pay (Y)
10
93
100
7056
840
15
171
25
36
30
20
204
0
729
0
20
156
0
441
0
35
261
225
7056
1260
20
177
350
15318
2130
SSxx
SSyy
SSxy
=CORREL(array1,array2)= 0.919908324
c. t.025 = TINV(0.05,3) = 3.182446305
d. . We reject the null hypothesis of zero correlation.
e. p-value = TDIST(4.063,3,2) = . 0.026883859
12.8 a. An increase in the price of $1, reduces its expected sales by 37.5 units.
b. Sales = 842 – (20)*37.5 = 92
c. From a practical point of view no. A zero price is unrealistic.
12.10 a. Increasing the average revenue by 1 million dollars raises the net income by $30,700.
b. If revenue is zero, then net income is 2277 millions dollars., suggests that the firm has net income when revenue is zero. Does not seem to be meaningful.
c. Revenue = 2277 + .0307*(1000) = 2307.7 million dollars
12.16 a.
Hours Worked (X)
Weekly Pay (Y)
10
93
100
7056
840
15
171
25
36
30
20
204
0
729
0
20
156
0
441
0
35
261
225
7056
1260
20
177
350
15318
2130
SSxx
SSyy
SSxy
b. ,, y = 55.286 + 6.086X
c.
Hours Worked (xi)
Weekly Pay (yi)
Estimated Pay ()
10
93
116.146
-23.146
535.7373
3703.209
7056
15
171
146.576
24.424
596.5318
925.6198
36
20
204
177.006
26.994
728.676
3.6E-05
729
20
156
177.006
-21.006
441.252
3.6E-05
441
35
261
268.296
-7.296
53.23162
8334.96
7056
20
177
177.006
-0.006
3.6E-05
3.6E-05
0
20
177
2355.429
12963.79
15318
SSE
SSR
SST
d.
e.
12.22 a. Y = 7.6425 + 0.9467*X
b. The 95% confidence interval is 0.9467 ± 2.145(0.0936) or (0.7460, 1.1473).
c. H0: ?1 ? 0 versus H1: ?1 > 0.
tcr = TINV(0.10,14) = 1.761 Reject the null hypothesis if t > 1.761. t = 10.118 so we reject the null hypothesis.
d. p-value =TDIST(10.11836,14,1) = 4.03644E-08 ? 0.000 so we reject the null hypothesis. The slope is positive.
12.24 a. Y = 614.930 ? 1.09.11*X
b. Intercept: t = 614.930/51.2343 = 12.002. Slope: t = ?109.112/51.3623 = ?2.124.
c. df = 18, t.025 = 2.101. (Excel: =TINV(0.05,18) = 2.1009)
d. Intercept: p-value =TDIST(12.002,18,2) = 5.03299E-10, Slope: p-value =TDIST(ABS(-2.124),18,2) = 0.047785001
e. (?2.124)2 = 4.51
f. This model has a poor fit. The F statistic is barely significant at a level of .05 and R2 = .2. Only 20% of the variation in units sold can be explained by average price.
Quiz Yourself Answers
True/False
Multiple Choice
1
T
6
F
1
A
6
C
11
D
2
T
7
T
2
B
7
A
12
B
3
T
8
T
3
C
8
D
13
C
4
F
9
F
4
A
9
B
14
D
5
T
5
D
10
C