UNIVERSITY OF TORONTO STA 302 H1F / 1001 H1F EXAMINATIONS

| August 30, 2017

Question
1. Expenditures on the criminal justice system is an area of continually rising cost in
the U.S. This question examines the relationship between the total number of police
employed in an American state, and the total spending on the state’s criminal justice
system (in millions of dollars US) for the 50 American states. The base 10 logarithm
of each variable was taken before fitting the model. The variables are named logpol
and logexp. Some output from SAS is given below. Seven of the numbers have been
replaced by upper case letters.

(a) (7 marks) Find the 7 missing values (A through G) in the SAS output.
(b) (3 marks) Construct simultaneous 90% confidence intervals for the slope and
intercept of the regression line.
(c) (4 marks) Carry out an hypothesis test to determine whether or not the data
give evidence that the coefficient of logexp is greater than 1.
(d) (5 marks) The District of Columbia (not one of the 50 states but a separate
region of the U.S.) spends 1,217,000,000 (1,217 million) dollars on its criminal
justice system. Predict how many police officers it has. Construct a 99% interval
for your value. Express your answer as a count of the number of police.
(e) (2 marks) On the next page there is a scatterplot of the logged data and a plot
of the residuals versus the predicted values for the fitted regression above. Using
information from the plots, give two reasons why you may not trust your prediction
in (b).
(f) (8 marks) The points for the states Alaska and Texas are identified in both plots
on the previous page. For each of these states, indicate what would happen to the
numbers below if the state was removed from the analysis. If a number changes,
indicate how. The numbers you should consider are:
Parameter Estimates, RootMSE, R-Square
2. For the mutiple linear regression model Y = X? + , the least squares estimates are
b = (X0X)
?1X0Y and the residuals are e = Y ? Xb. Assume the Gauss-Markov
conditions hold.
(a) (2 marks) Show that b is an unbiased estimator of ?.
(b) (5 marks) Show Cov(b) = ?
2
(X0X)
?1
.
(c) (5 marks) Show e = (I ? H) and Cov(e) = (I ? H)?
2 where H = X(X0X)
?1X0
.
(d) (3 marks) Find Cov(Yˆ ) in terms of the matrix H, where Yˆ = Xb
3. The data for this question are the proportion of male births (variable name: pmale)
in Canada and the United States for the years 1970 through 1990 (variable name:
year). Regressions were carried out separately for the two countries. Output from
SAS is given below. Questions begin on the next page.
(a) (4 marks) Are the proportions of male births on the decline in Canada and the
U.S.? What can you conclude from these regressions to answer this question?
(b) (3 marks) Explain why the United States has the larger t statistic for the test
of H0 : ?1 = 0 even though its slope is closer to zero. Give both a statistical
explanation and suggest a practical reason why this happened.
(c) (4 marks) Give an equation for a single linear model from whose fit both of the
regression equations from the output above can be obtained. Be sure to define
all of your variables and explain how to test whether the change in the male birth
rate differs between the two countries.
(d) (4 marks) On the next page there are residual plots for the regression of pmale
on year for Canada. What additional information about the data is provided by
these plots? How does this affect your answer to part (a) for Canada?
4. An experiment was carried out with the goal of constructing a model for total oxygen
demand in dairy wastes as a function of five laboratory measurements. Data were
collected on samples kept in suspension in water in a laboratory for 220 days. Although
all observations were taken on the same sample over time, assume that they are
independent. The measured variables are:
Y log of oxygen demand (demand measured in mg oxygen per minute)
X1 biological oxygen demand, mg/liter
X2 total Kjeldahl nitrogen, mg/liter
X3 total solids, mg/liter
X4 total volatile solids, mg/liter
X5 chemical oxygen demand, mg/liter

(a) (1 mark) Test the hypothesis that the coefficient of x3 is zero using the output
from the first regression.
(b) (1 mark) Test the hypothesis that the coefficient of x3 is zero using the output
from the second regression.
(c) (2 marks) Explain why there is a difference in your answers to parts (a) and (b).
(d) (3 marks) State the null and alternative hypotheses for the Analysis of Variance
F test for the first regression. What conclusion do you draw from its p-value?
(For the conclusion, do not say whether or not you reject the null hypothesis but
rather say what the test tells you about the linear model.)
(e) (3 marks) Use the output from the first regression to test the joint hypothesis
?4 = 0, ?5 = 0.
(f) (4 marks) Which model do you prefer? Justify your choice.
(g) (2 marks) What residual plots would you like to see to check whether it is reasonable
to “treat the observations as independent”?
5. For each of the following questions, give brief answers (one or two sentences). Answers
without explanation will not receive any marks.
(a) (2 marks) In a simple regression of weight on height for a sample of adult males,
the estimated intercept is 5 kg. Interpret this value for someone who has not
taken any statistics courses.
(b) (2 marks) In simple linear regression, why can an R2 value close to 1 not be used
as evidence that the model is appropriate?
(c) (2 marks) Suppose that the variance of the estimated slope in the simple regression
of Y on X1 is 10. Suppose that X2 is added to the model, and that X2 is
uncorrelated with X1. Will the variance of the coefficient of X1 still be 10?
(d) (2 marks) A regression analysis was carried out with response variable sales of
a product (Y ) and two predictor variables: the amount spent on print advertisements
(X1) and the amount spent on television advertisements (X2) for the
product. The fitted equation was Yˆ = ?2.35 + 2.36X1 + 4.18X2 ?.35X1X2. The
test for whether the coefficient of the interaction term is zero had p-value less
than 0.0001. Explain what this test means in practical terms for the company
executive who has never studied statistics.
(e) (2 marks) Explain why we might prefer to use adjusted R2
rather than R2 when comparing two models
6. (5 marks) A large real estate firm in Toronto has been keeping records on selling prices
for single family dwellings. They have also recorded numerous other features of the
houses that sold, including square footage, number of rooms, property taxes, type of
heating, lot size, area of city, existence of finished basement, etc. An agent for this
firm hopes to use these data to show that the rate of change in house prices over the
past seven years differs depending on area of the city. Describe how you would help
the agent.

Get a 30 % discount on an order above $ 100
Use the following coupon code:
RESEARCH
Order your essay today and save 30% with the discount code: RESEARCHOrder Now
Positive SSL