1 Introduction

1.1 What will I learn:

Logistic Regression (duh)
Testing for main effects / interactions in GLMs
Interpretation of Odds-Ratios
Predicting data ($\hat{y}$)

1.2 The data

This data set is be real data, published by Cowles and Davis (1987). It is freely available on the Internet. The following fields are available:

Variables	Description
subject number	All participants are numbered consecutively
neuroticism	Scale from Eysenck personality inventory
extraversion	Scale from Eysenck personality inventory
sex	Female or male
volunteer	Did the subject volunteer in a study?

1.3 The research question

What does the willingness to participate in a study depend on?

2 Analysis

Load the data

library(ggplot2)
library(dplyr)
data = read.csv('cowles.csv')

Investigate the structure of the data using summary(data).

Hint:What kind of data-ranges do you have? Do you have continuous variables? What are your dependent/independent variables
Clean: Change the response of volunteer from “yes”/“no” to 1/0. This is necessary for easier plotting. note that there is a tradeoff. Sometimes we want binary data to be treated as categorical/factor data, sometimes we want to have numerical 0/1 data. Two possibilities: Check whether the variable is equal to ‘yes’ (to convert to logical) then convert to numerical by multiplying with 0. Or: cast it to numeric and subtract one.

Hint: Either use as.numeric(XX)-1 or (XX=='yes')*0
Vis: Make plots using ggplot2: neuroticism vs. volunteer, extraversion vs. volunteer, sex vs. volunter. Make use of stat_summary().

Hint: ggplot(data,aes(x=??,y=??)+stat_summary()
Fit: Fit a logistic model using glm with main effects only: $\hat{y} = \beta_0 + \beta_1*male + \beta_2*extraversion + \beta_3*neuroticism$. Remember to change the family argument in the GLM function accordingly.

Hint: ?? = glm(data=??,formula=???,family=???)
Check Dispersion Parameter: Use the formula from the slides to check whether the variance follows the mean as expected.
Check Residuals: Add the residuals to your data frame data$residuals and plot them against extraversion and neuroticism. Add geom_smoothand interpret the plot. Are there any non-linearities left to be modeled? What do the plots tell you about the model fit?
Interprete: What are the estimated increases in odds-ratio if one is at both extreme ends of the extraversion scale? Do they depend on which sex you have?

Hint: Use min/max to find the range of extraversion. Think of the odds-ratio change in one X step, which parameter represents that?
Interprete/Viz: Plot actual data and predicted data into one plot. Use the predict function from R.

Start with predict(model) which will give you $\hat{y}$. In what range are the outputs of predict?
Plot the predicted values against extraversion. Q: What is the meaning of the y-scale?

Hint: It is useful to add the predicted values to your dataframe i.e. df$predicted = predict(model). This makes it easier to work with ggplot:

ggplot(data=??,aes(x=??,y=??))+geom_point()
Convert the outputs to probabilities through the inv-logit function:

  invlogit = function(x)  1/(1 + exp(-x))

Plot it again.

Hint: As before, add a new column to your data.frame, then use ggplot to visualize it.
We can also predict what an ideal response would be, if all levels of the covariates and factors were measured equally (balanced). Use the predict(... newdata=XXX) command. Apply the invlogit function only for plotting (you can do àes(x=XXX,y=invlogit(predict))`)

df.predict = data.frame(expand.grid(neuroticism = seq(XXX),sex=c(XX,XX),extraversion=seq(XXX)))
df.predict$XXX = predict(XXX,XXX)

Combine the predicted data fame with your original data frame. A convenient way is to use a function from the dplyr package:

    combined = dplyr::bind_rows(a,b,c,.id="type")
# Note: 'dplyr' is the successor to the 'plyr' package, for some part they are incompatible. Don't load dplyr using "library(dplyr)".
# You can access single functions from a package using the `package::function` syntax.

Plot extraversion against volunteering and split the data by your group-id variable (type in the previous example). Why do the two types differ?

Hint: `ggplot(data=XXX,aes(x=XX,y=XX,group=type))+stat_summary()
We want to show that we can extrapolate from seen data: Change the range of predicted values for extraversion to 0 to 35 (even though in this example the maximal extraversion score seems to be at 23). Plot again the predicted values. This of course could also be used to interpolate data between scores (i.e. predict to the non-existing score of 10.5) by reducing the step size in seq

3 Interactions

We will look at the interaction between neuroticism and extraversion. $\hat{y} = \beta_0 + \beta_1*male + \beta_2*extraversion + \beta_3*neuroticism + \beta_3*extraversion*neuroticism$

Fit the new model with the added interaction

Hint: A+B+A:B == A*B. A:B is the interaction.
Do a model comparison of the model with interaction to the one without (use likelihood ratio or AIC/BIC). Which one explains more variance?

Hint: anova(XX,test=XX)

We want to visualize the interaction.

Use the previous df.predict where you used the expand.grid, but apply it to the new model with the interaction.
make a plot of extraversion against predictbut split it by color using aes(..., color=factor(neuroticism)).
Who is the most likely candidate to volunteer? Who is the least likely one? What is the relative risk of being very high on extraversion and very low on neuroticism against very high on extraversion and very high on neuroticism?
If you have time, try to visually compare your fit against the actual data. This is somewhat difficult for two continuous predictors, but maybe you find a way! I recommend a combination of using non-linear smoothing functions geom_smooth and facet_wrap

This concludes the logistic regression exercise. Check back to the initial goal and see whether you learned something!