STAT 4/6C03:Final exam 2018

Rules

You may use any notes or books, but please do not use any web resources you have not previously downloaded. Please do not speak with any one other than me about the exam.
The data sets required are available in the HW directory on the class web site. As usual, please let me know ASAP if you find any problems with the data or with the exam.
The exam is due in the dropbox on Avenue to Learn by midnight on Friday 21 December (i.e. the end of Friday). Please submit your solutions as a single .Rnw or .R file. Your solution should include self-contained, working R code (points off for anything that doesn’t work when I try it!) and an explanation of what you did.

I. Contrasts and interactions

The file fakerice.csv has some (made-up) data on rice growth with four seed treatments (control, hydrochloric acid (HCl: inorganic), propionic (organic) and butyric (organic) acid).

Set up a custom contrast matrix where the parameters represent (a) control; (b) difference in growth rate between control and acid treatments; (c) difference in growth rate between inorganic and organic acid treatments; (d) difference between growth rate under the two organic acid treatments.
In a linear model growth~acid, show the manual calculations that give the estimated value of the four parameters defined in question 1.
Fit the model with the full interaction of acid and env, using the custom contrast matrix for acid and default contrasts for env. Explain in words what the main effects of acid=inorganic vs organic and env=dark vs light are estimating in this model.
Refit the model with sum-to-zero contrasts for env. Explain the meanings of inorganic-vs-organic and dark-vs-light in this case. What is the relationship between the interaction coefficients in this model and the previous model? Why?

II. Seatbelts data set

The file seatbelts.csv contains information on numbers of people killed or injured in road accidents in the UK (it’s a modified version of the ?Seatbelts data set from base R). The columns are date, year, month (self-explanatory); km (“distance driven”, must be in thousands? of kilometers); PetrolPrice (petrol=gasoline); law (whether or not a seatbelt law was in place); type (drivers, front-seat passengers, or back-seat passengers); kill_injure (number killed or seriously injured in accidents)

Using ggplot or base R graphics, plot (a) the number of deaths and injuries per km driven as a function of time, subdivided by type. Add a vertical line at the time the law came into effect (b) the number of deaths and injuries per km by month, subdivided by type. (You can do the plots any way you want, but I found the results for the monthly pattern were clearest if I plotted lines with each type in a separate facet and used aes(group=year); using a log10 scale for the y-axis helped too.)
Create a date variable date0 that measures time in years since the beginning of the data set; fit an appropriate GLM (remember these are count data) that predicts the number of deaths and injuries per kilometer driven using the main effects date0, law, cos(2*pi*month/12) and sin(2*pi*month/12) (to account for seasonal effects), and PetrolPrice, and their interactions with driver/passengers type.
Produce graphical diagnostic plots; what do you conclude?
Check for overdispersion and adjust/refit your model appropriately; explain what/why you did.
Evaluate the statistical significance of the interactions (at the level of the term, not the level of the individual parameters) - in particular, evaluate the effect of adding or removing cos(2*pi*month/12) and sin(2*pi*month/12) together, to test the seasonal effect, rather than testing cos and sin separately. You can do this any way you want (Wald, likelihood ratio test, bootstrap :-) ), but you should specify which one you did, and explain its benefits and limitations.
Evaluate the effect of law on the three driver/passenger types, including the confidence intervals of the parameters involving law. Don’t just evaluate significance; interpret the quantitative values of the parameter estimates.
Generate predicted values for your model, and confidence intervals on the predictions, and create a plot that shows the raw data along with predicted values and their confidence intervals.

III. Salamanders (zero-inflation)

The Salamanders data set (http://dx.doi.org/10.5061/dryad.5m8f6) gives the number of salamanders of different species counted in different streams. site=samling site; sample=sample number at a given site; count=number of salamanders; cover=(scaled) availability of refuge within the stream; mined=whether the site was affected by mountaintop mining or not; DOP=(scaled) days since precipitation; Wtemp=(scaled) water temperature; DOY=(scaled) day of year. spp is the salamander species; the sal_agg data set combines all species within a sample.

Use the aggregated data set for the rest of this question.

Create univariate plots of number of counts (response) as a function of each of the continuous covariates; use colour or shape to distinguish between samples from mined and non-mined sites.
Fit a negative binomial model (NB2, i.e. variance is a quadratic function of the mean) using main effects of all of the available covariates (DOP, Wtemp, DOY, mined).
Plot diagnostics, colouring the points by whether they correspond to mined sisites or not.
Generate 1000 predictive simulations from your fitted model; produce a plot showing the distribution of the number of zero-count observations in the ensemble of simulations. Compare the number of zero-count observations in the data and draw conclusions about the need (or not) for a zero-inflated model.
Using the glmmTMB package, fit regular and zero-inflated NB1 and NB2 models (assume the level of zero-inflation is independent of the covariates). Compute AIC for all four models and describe your conclusions about which model is best, why, and by how much. Repeat the predictive simulations and count-of-zeros plot for your best model.
What conclusions do you draw about the effect of mining on (aggregated) salamander counts?

IV. Mixed models

The Early data set is from the mlmRev package:

The 103 infants from low income African American families were divided into a treatment group (58 infants) and a control group (45 infants). Starting at 0.5 years of age the infants in the treatment group were exposed to an enriched environment. Each infant’s cognitive score on an age-specific, normalized scale was recorded at ages 1, 1.5, and 2 years.

The predictor variables are id (ID number), age, and trt (N/Y). The response is cog (cognitive score).

You may choose either lme4 or lmerTest to fit the mixed models in this example.

Graph the data in a way that shows all of the effects on cog, including distinguishing among individuals in some way.
Construct two modified versions of the age variable, one (fAge) a factor with associated sum-to-zero contrasts, and one (cAge) a zero-mean numeric variable (these are already in the file, although you may still need to convert fAge to a factor and set its contrasts). Fit three ordinary linear models (ignoring id, but including the interaction of age and treatment), one with each version of the age variable (original, factor, mean-centered). Describe and compare the three models, especially the statistical significance of the main effect of treatment. Use AIC to compare the three models and interpret the results (differences in both df and AIC.

Regardless of the conclusions you reached in the previous step, use the centered-mean (cAge) predictor in the remainder of this question.