--- title: "Logistic Regression" output: html_document: toc: true toc_depth: 2 toc_float: true number_sections: true theme: flatly highlight: zenburn css: ../rmarkdown_hw.css --- {r include=FALSE} ADDITIONALHINT= F showSolution=T  # Introduction ## What will I learn: - Logistic Regression (duh) - Testing for main effects / interactions in GLMs - Interpretation of Odds-Ratios - Predicting data ($\hat{y}$) ## The data This data set is be real data, published by [Cowles and Davis (1987)](http://onlinelibrary.wiley.com/doi/10.1111/j.2044-8309.1987.tb00769.x/full). It is freely available on the [Internet](http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Cowles.txt). The following fields are available: Variables | Description ------------------- | ----------------------------------------------- subject number | All participants are numbered consecutively neuroticism | Scale from Eysenck personality inventory extraversion | Scale from Eysenck personality inventory sex | Female or male volunteer | Did the subject volunteer in a study? ## The research question What does the willingness to participate in a study depend on? # Analysis 1. Load the data {r,warning=FALSE, message=F} library(ggplot2) library(dplyr) data = read.csv('cowles.csv')  1. Investigate the structure of the data using *summary(data)*. {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*:What kind of data-ranges do you have? Do you have continuous variables? What are your dependent/independent variables ")  {r echo=showSolution,eval=showSolution} summary(data)  1. *Clean*: Change the response of *volunteer* from "yes"/"no" to 1/0. This is necessary for easier plotting. note that there is a tradeoff. Sometimes we want binary data to be treated as categorical/factor data, sometimes we want to have numerical 0/1 data. Two possibilities: Check whether the variable is equal to 'yes' (to convert to logical) then convert to numerical by multiplying with 0. Or: cast it to numeric and subtract one. {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: Either use as.numeric(XX)-1 or (XX=='yes')*0 ")  {r echo=showSolution,eval=showSolution} data$volunteer = (data$volunteer=='yes')+0  1. *Vis*: Make plots using ggplot2: *neuroticism* vs. *volunteer*, *extraversion* vs. *volunteer*, *sex* vs. *volunter*. Make use of  stat_summary(). {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: ggplot(data,aes(x=??,y=??)+stat_summary() ")  {r echo=showSolution,eval=showSolution} # Solutions #data = data[,!(colnames(data) %in% 'X')] # onlz needed if you reaload the very original data # One Way Interactions ggplot(data,aes(x=sex,y=volunteer))+ stat_summary() ggplot(data,aes(x=extraversion,y=volunteer))+ stat_summary() ggplot(data,aes(x=neuroticism,y=volunteer))+ stat_summary() # Two way interactions with sex ggplot(data,aes(x=extraversion,y=volunteer,color=sex))+ stat_summary() ggplot(data,aes(x=neuroticism,y=volunteer,color=sex))+ stat_summary()  1. *Fit*: Fit a logistic model using *glm* with main effects only: $\hat{y} = \beta_0 + \beta_1*male + \beta_2*extraversion + \beta_3*neuroticism$. Remember to change the family argument in the GLM function accordingly. {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: ?? = glm(data=??,formula=???,family=???) ")  {r echo=showSolution,eval=showSolution} mres = glm(family=binomial,data=data,formula=volunteer~extraversion+neuroticism+sex)  1. *Check Dispersion Parameter*: Use the formula from the slides to check whether the variance follows the mean as expected. {r echo=showSolution,eval=showSolution} sum(resid(mres, type = "pearson")^2) / df.residual(mres) # should be 1  1. *Check Residuals*: Add the residuals to your data frame data$residuals and plot them against extraversion and neuroticism. Add geom_smoothand interpret the plot. Are there any non-linearities left to be modeled? What do the plots tell you about the model fit? {r echo=showSolution,eval=showSolution} data$resid = (residuals(mres)) ggplot(data,aes(x=extraversion,y=resid))+ geom_point()+ geom_smooth()  1. *Interprete*: What are the estimated increases in odds-ratio if one is at both extreme ends of the extraversion scale? Do they depend on which sex you have? {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: Use min/max to find the range of extraversion. Think of the odds-ratio change in one X step, which parameter represents that? ")  {r echo=showSolution,eval=showSolution} ra = diff(range(data$extraversion)) print(ra) coeffs = broom::tidy(mres) print(coeffs) exp(coeffs$estimate[coeffs$term=="extraversion"]*ra) "No, they don't depend because we do not have an interaction modeled yet. The relative risks would depend though!" #logit <=> log(odds) "logit(y(extra=2)) = b0 + b1 * 2 + b2*is_male + b1*2*is_male" "logit(y(extra=23)) = b0 + b1 * 23 + b2*is_male + b1*23*is_male" "b0 + b1*23 +b2*is_male - (b0 + b1*2 + b2*is_male)" "b0 + b1*23 +b2*is_male - b0 - b1*2 -b2*is_male" "b1*21" #(exp(b1*21)  1. *Interprete/Viz*: Plot actual data and predicted data into one plot. Use the predict function from R. + Start with *predict(model)* which will give you$\hat{y}$. In what range are the outputs of predict? + Plot the predicted values against extraversion. ***Q:*** What is the meaning of the y-scale? {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: It is useful to add the predicted values to your dataframe i.e. df$predicted = predict(model). This makes it easier to work with ggplot: ggplot(data=??,aes(x=??,y=??))+geom_point() ")  {r echo=showSolution,eval=showSolution} data$predict = predict(mres) ggplot(data=data,aes(x=extraversion,y=predict))+ geom_point() # Add some color to explain main effect ggplot(data=data,aes(x=extraversion,y=predict,colour=sex))+ geom_point()  + Convert the outputs to probabilities through the inv-logit function: {r,eval=T} invlogit = function(x) 1/(1 + exp(-x)) invlogit = function(x) exp(x) / (1+exp(x))  {r echo=showSolution,eval=showSolution} data$predictProb = invlogit(data$predict) # Alternative: data$predictProb = residuals(mres,type="response")  + Plot it again. {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: As before, add a new column to your data.frame, then use ggplot to visualize it. ")  {r echo=showSolution,eval=showSolution} ggplot(data=data,aes(x=extraversion,y=predictProb,colour=sex))+ geom_point()  + We can also predict what an ideal response would be, if all levels of the covariates and factors were measured equally (balanced). Use the  predict(... newdata=XXX) command. Apply the invlogit function only for plotting (you can do àes(x=XXX,y=invlogit(predict))) {r,eval=F} df.predict = data.frame(expand.grid(neuroticism = seq(XXX),sex=c(XX,XX),extraversion=seq(XXX))) df.predict$XXX = predict(XXX,XXX)  {r echo=showSolution,eval=showSolution} df.predict = data.frame(expand.grid(neuroticism = seq(0,24),sex=c('male','female'),extraversion=seq(2,23))) df.predict$predict = predict(mres,newdata = df.predict) ggplot(df.predict,aes(x=extraversion,y=invlogit(predict)))+ geom_point()  - Combine the predicted data fame with your original data frame. A convenient way is to use a function from the dplyr package: {r,eval=F} combined = dplyr::bind_rows(a,b,c,.id="type") # Note: 'dplyr' is the successor to the 'plyr' package, for some part they are incompatible. Don't load dplyr using "library(dplyr)". # You can access single functions from a package using the package::function syntax.  {r echo=showSolution,eval=showSolution} combined = dplyr::bind_rows(data,df.predict,.id="type")  - Plot extraversion against volunteering and split the data by your group-id variable (*type* in the previous example). Why do the two types differ? {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: ggplot(data=XXX,aes(x=XX,y=XX,group=type))+stat_summary() ")  {r echo=showSolution,eval=showSolution} ggplot(data=combined,aes(x=extraversion,y=invlogit(predict),group=type,color=type))+ stat_summary()  - We want to show that we can extrapolate from seen data: Change the range of predicted values for extraversion to 0 to 35 (even though in this example the maximal extraversion score seems to be at 23). Plot again the predicted values. This of course could also be used to interpolate data between scores (i.e. predict to the non-existing score of 10.5) by reducing the step size in seq {r echo=showSolution,eval=showSolution} df.predict2 = data.frame(expand.grid(neuroticism = seq(0,24),sex=c('male','female'),extraversion=seq(0,35))) df.predict2$predict = predict(mres,newdata = df.predict2) ggplot(df.predict2,aes(x=extraversion,y=invlogit(predict)))+ geom_point()  # Interactions 1. We will look at the interaction between neuroticism and extraversion.$\hat{y} = \beta_0 + \beta_1*male + \beta_2*extraversion + \beta_3*neuroticism + \beta_3*extraversion*neuroticism$- Fit the new model with the added interaction {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: A+B+A:B == A*B. A:B is the interaction. ")  {r echo=showSolution,eval=showSolution} mres2 = glm(family=binomial,data=data,formula=volunteer~extraversion*neuroticism+sex)  - Do a model comparison of the model with interaction to the one without (use likelihood ratio or AIC/BIC). Which one explains more variance? {r echo=FALSE,eval=ADDITIONALHINT,results='asis'} cat(" *Hint*: anova(XX,test=XX) ")  {r echo=showSolution,eval=showSolution} anova(mres,mres2,test="LRT")  2. We want to visualize the interaction. - Use the previous df.predict where you used the expand.grid, but apply it to the new model with the interaction. - make a plot of extraversion against predictbut split it by color using aes(..., color=factor(neuroticism)). {r echo=showSolution,eval=showSolution} df.predict$predictFull = predict(mres2,newdata=df.predict) ggplot(df.predict,aes(x=extraversion, y=invlogit(predictFull), color=factor(neuroticism)))+ geom_point()  - Who is the most likely candidate to volunteer? Who is the least likely one? What is the *relative risk* of being very high on extraversion and very low on neuroticism against very high on extraversion and very high on neuroticism? - If you have time, try to visually compare your fit against the actual data. This is somewhat difficult for two continuous predictors, but maybe you find a way! I recommend a combination of using non-linear smoothing functions geom_smooth and facet_wrap {r echo=showSolution,eval=showSolution} ggplot(df.predict,aes(x=extraversion, y=invlogit(predictFull)))+ geom_point()+ geom_path()+ geom_point(data=data,aes(y=volunteer),color='red')+ geom_smooth(data=data,aes(y=volunteer),color='red')+ facet_wrap(~neuroticism)+scale_y_continuous(limits=c(0,1))  This concludes the logistic regression exercise. Check back to the initial goal and see whether you learned something!