This data set is be real data, published by Cowles and Davis (1987). It is freely available on the Internet. The following fields are available:
Variables | Description |
---|---|
subject number | All participants are numbered consecutively |
neuroticism | Scale from Eysenck personality inventory |
extraversion | Scale from Eysenck personality inventory |
sex | Female or male |
volunteer | Did the subject volunteer in a study? |
What does the willingness to participate in a study depend on?
library(ggplot2)
library(dplyr)
data = read.csv('cowles.csv')
Investigate the structure of the data using summary(data).
Clean: Change the response of volunteer from “yes”/“no” to 1/0. This is necessary for easier plotting. note that there is a tradeoff. Sometimes we want binary data to be treated as categorical/factor data, sometimes we want to have numerical 0/1 data. Two possibilities: Check whether the variable is equal to ‘yes’ (to convert to logical) then convert to numerical by multiplying with 0. Or: cast it to numeric
and subtract one.
Vis: Make plots using ggplot2: neuroticism vs. volunteer, extraversion vs. volunteer, sex vs. volunter. Make use of stat_summary()
.
Fit: Fit a logistic model using glm with main effects only: \(\hat{y} = \beta_0 + \beta_1*male + \beta_2*extraversion + \beta_3*neuroticism\). Remember to change the family argument in the GLM function accordingly.
Check Dispersion Parameter: Use the formula from the slides to check whether the variance follows the mean as expected.
Check Residuals: Add the residuals to your data frame data$residuals
and plot them against extraversion and neuroticism. Add geom_smooth
and interpret the plot. Are there any non-linearities left to be modeled? What do the plots tell you about the model fit?
Interprete: What are the estimated increases in odds-ratio if one is at both extreme ends of the extraversion scale? Do they depend on which sex you have?
predict
function from R.Plot the predicted values against extraversion. Q: What is the meaning of the y-scale?
Convert the outputs to probabilities through the inv-logit function:
invlogit = function(x) 1/(1 + exp(-x))
Plot it again.
We can also predict what an ideal response would be, if all levels of the covariates and factors were measured equally (balanced). Use the predict(... newdata=XXX)
command. Apply the invlogit
function only for plotting (you can do àes(x=XXX,y=invlogit(predict))`)
df.predict = data.frame(expand.grid(neuroticism = seq(XXX),sex=c(XX,XX),extraversion=seq(XXX)))
df.predict$XXX = predict(XXX,XXX)
dplyr
package: combined = dplyr::bind_rows(a,b,c,.id="type")
# Note: 'dplyr' is the successor to the 'plyr' package, for some part they are incompatible. Don't load dplyr using "library(dplyr)".
# You can access single functions from a package using the `package::function` syntax.
Plot extraversion against volunteering and split the data by your group-id variable (type in the previous example). Why do the two types differ?
We want to show that we can extrapolate from seen data: Change the range of predicted values for extraversion to 0 to 35 (even though in this example the maximal extraversion score seems to be at 23). Plot again the predicted values. This of course could also be used to interpolate data between scores (i.e. predict to the non-existing score of 10.5) by reducing the step size in seq
Fit the new model with the added interaction
Do a model comparison of the model with interaction to the one without (use likelihood ratio or AIC/BIC). Which one explains more variance?
df.predict
where you used the expand.grid
, but apply it to the new model with the interaction.make a plot of extraversion
against predict
but split it by color using aes(..., color=factor(neuroticism))
.
Who is the most likely candidate to volunteer? Who is the least likely one? What is the relative risk of being very high on extraversion and very low on neuroticism against very high on extraversion and very high on neuroticism?
If you have time, try to visually compare your fit against the actual data. This is somewhat difficult for two continuous predictors, but maybe you find a way! I recommend a combination of using non-linear smoothing functions geom_smooth
and facet_wrap
This concludes the logistic regression exercise. Check back to the initial goal and see whether you learned something!