# UNIVERSITY OF TORONTO STA 302 H1F / 1001 H1F EXAMINATIONS

Question

1. Expenditures on the criminal justice system is an area of continually rising cost in

the U.S. This question examines the relationship between the total number of police

employed in an American state, and the total spending on the state’s criminal justice

system (in millions of dollars US) for the 50 American states. The base 10 logarithm

of each variable was taken before fitting the model. The variables are named logpol

and logexp. Some output from SAS is given below. Seven of the numbers have been

replaced by upper case letters.

(a) (7 marks) Find the 7 missing values (A through G) in the SAS output.

(b) (3 marks) Construct simultaneous 90% confidence intervals for the slope and

intercept of the regression line.

(c) (4 marks) Carry out an hypothesis test to determine whether or not the data

give evidence that the coefficient of logexp is greater than 1.

(d) (5 marks) The District of Columbia (not one of the 50 states but a separate

region of the U.S.) spends 1,217,000,000 (1,217 million) dollars on its criminal

justice system. Predict how many police officers it has. Construct a 99% interval

for your value. Express your answer as a count of the number of police.

(e) (2 marks) On the next page there is a scatterplot of the logged data and a plot

of the residuals versus the predicted values for the fitted regression above. Using

information from the plots, give two reasons why you may not trust your prediction

in (b).

(f) (8 marks) The points for the states Alaska and Texas are identified in both plots

on the previous page. For each of these states, indicate what would happen to the

numbers below if the state was removed from the analysis. If a number changes,

indicate how. The numbers you should consider are:

Parameter Estimates, RootMSE, R-Square

2. For the mutiple linear regression model Y = X? + , the least squares estimates are

b = (X0X)

?1X0Y and the residuals are e = Y ? Xb. Assume the Gauss-Markov

conditions hold.

(a) (2 marks) Show that b is an unbiased estimator of ?.

(b) (5 marks) Show Cov(b) = ?

2

(X0X)

?1

.

(c) (5 marks) Show e = (I ? H) and Cov(e) = (I ? H)?

2 where H = X(X0X)

?1X0

.

(d) (3 marks) Find Cov(Yˆ ) in terms of the matrix H, where Yˆ = Xb

3. The data for this question are the proportion of male births (variable name: pmale)

in Canada and the United States for the years 1970 through 1990 (variable name:

year). Regressions were carried out separately for the two countries. Output from

SAS is given below. Questions begin on the next page.

(a) (4 marks) Are the proportions of male births on the decline in Canada and the

U.S.? What can you conclude from these regressions to answer this question?

(b) (3 marks) Explain why the United States has the larger t statistic for the test

of H0 : ?1 = 0 even though its slope is closer to zero. Give both a statistical

explanation and suggest a practical reason why this happened.

(c) (4 marks) Give an equation for a single linear model from whose fit both of the

regression equations from the output above can be obtained. Be sure to define

all of your variables and explain how to test whether the change in the male birth

rate differs between the two countries.

(d) (4 marks) On the next page there are residual plots for the regression of pmale

on year for Canada. What additional information about the data is provided by

these plots? How does this affect your answer to part (a) for Canada?

4. An experiment was carried out with the goal of constructing a model for total oxygen

demand in dairy wastes as a function of five laboratory measurements. Data were

collected on samples kept in suspension in water in a laboratory for 220 days. Although

all observations were taken on the same sample over time, assume that they are

independent. The measured variables are:

Y log of oxygen demand (demand measured in mg oxygen per minute)

X1 biological oxygen demand, mg/liter

X2 total Kjeldahl nitrogen, mg/liter

X3 total solids, mg/liter

X4 total volatile solids, mg/liter

X5 chemical oxygen demand, mg/liter

(a) (1 mark) Test the hypothesis that the coefficient of x3 is zero using the output

from the first regression.

(b) (1 mark) Test the hypothesis that the coefficient of x3 is zero using the output

from the second regression.

(c) (2 marks) Explain why there is a difference in your answers to parts (a) and (b).

(d) (3 marks) State the null and alternative hypotheses for the Analysis of Variance

F test for the first regression. What conclusion do you draw from its p-value?

(For the conclusion, do not say whether or not you reject the null hypothesis but

rather say what the test tells you about the linear model.)

(e) (3 marks) Use the output from the first regression to test the joint hypothesis

?4 = 0, ?5 = 0.

(f) (4 marks) Which model do you prefer? Justify your choice.

(g) (2 marks) What residual plots would you like to see to check whether it is reasonable

to “treat the observations as independent”?

5. For each of the following questions, give brief answers (one or two sentences). Answers

without explanation will not receive any marks.

(a) (2 marks) In a simple regression of weight on height for a sample of adult males,

the estimated intercept is 5 kg. Interpret this value for someone who has not

taken any statistics courses.

(b) (2 marks) In simple linear regression, why can an R2 value close to 1 not be used

as evidence that the model is appropriate?

(c) (2 marks) Suppose that the variance of the estimated slope in the simple regression

of Y on X1 is 10. Suppose that X2 is added to the model, and that X2 is

uncorrelated with X1. Will the variance of the coefficient of X1 still be 10?

(d) (2 marks) A regression analysis was carried out with response variable sales of

a product (Y ) and two predictor variables: the amount spent on print advertisements

(X1) and the amount spent on television advertisements (X2) for the

product. The fitted equation was Yˆ = ?2.35 + 2.36X1 + 4.18X2 ?.35X1X2. The

test for whether the coefficient of the interaction term is zero had p-value less

than 0.0001. Explain what this test means in practical terms for the company

executive who has never studied statistics.

(e) (2 marks) Explain why we might prefer to use adjusted R2

rather than R2 when comparing two models

6. (5 marks) A large real estate firm in Toronto has been keeping records on selling prices

for single family dwellings. They have also recorded numerous other features of the

houses that sold, including square footage, number of rooms, property taxes, type of

heating, lot size, area of city, existence of finished basement, etc. An agent for this

firm hopes to use these data to show that the rate of change in house prices over the

past seven years differs depending on area of the city. Describe how you would help

the agent.

**30 %**discount on an order above

**$ 100**

Use the following coupon code:

RESEARCH