Slides & my Presentation style

Click this link to add notes to the slides

this adds ?showNotes=true to the file

You can download the pdf or print them yourself:

click this link for pdf printable (use google chrome)

click this link for pdf printable + notes

HTML is best viewed in chrome

Organization

09.00 - 10.00 Lecture I
10.00 - 10.30 Q/A + break
10.30 - 11.30 Lecture II
11.30 - 12.00 Q/A
12.00 - 13.00 Mensa
13.00 - ~14.00 Lecture III / recap of yesterday’s exercises
~ 14.30 - ~18.00 Exercises in R and we offer help doing the exercises @ Gebäude 50!

Exam

Too many people for oral exam

=> Exam in 3-4 Weeks (date to be determined)

Scope: Content of course. Exercises are just to enhance your understanding (they are for you! Come and enjoy :-)).

No need to write R-Code in the course, you should be able to navigate R-Output (e.g. model output).

Athena: https://athena.cogsci.uos.de/

Please fill in the feedback after the second session

Course Overview

Day 1: There is just one statistical test, linear regression, multiple regression, 2x2 design
Day 2: Inference, Assumptions, Philosophy, t-test+ANOVA+ANCOVA
Day 3: Logistic Regression, GLM
Day 4: Mixed Models
Day 5: Leftovers

Course Overview

images

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Purpose of statistics

Organization

Description

Inference

Why statistics in the first place?

Because of Variation

sources of variation

source-youtube-video of image

There are many answers, variation is just one motivation. - natural : The variation that is natural to the sample, e.g. apples have a distribution of redness, not a single value - sampling error: The variation we get when we take a subset of a population. I.e. two baskets of red apples have different sample distributions of redness - explainable: The variation we can explain by e.g. knowing which apple belongs to which variety of apples - non-sampling error: The variation we get e.g. non-randomly samplig apple trees (get all apples from one tree, but the orchard is actually mixed variety - we are missing out on other trees)

“The idea of an infinite population distributed in a frequency distribution in respect of one or more characters is fundamental to all statistical work. From a limited experience., we may obtain some idea of the infinite hypothetical population from which our sample is drawn, and so of the probable nature of future samples to which our conclusions are to be applied. If a second sample belies this expectation we infer that it is. drawn from a different population; that the treatment to which the second sample. had been exposed did in fact make a material difference, or. had materially altered. Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first” (Fisher, 19542, p.41).

Most important lesson

Always visualize your data

…make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding. F. J. Anscombe, 1973

dino

These data all have the same mean + variance + correlation!

Overview

Warmup
Why statistics
do it yourself statistics
- three important concepts
- there is only one statistical test
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Three extremely important concepts

Population
Sample
Sampling Distribution

Population

This could be the height of the population

of all people in this room.

This could be the height of the population of all people in the world. The upper example is a sub-population.

Sample

Height measured of 12 different samples (=experiment) of a population.

Sampling Distribution

sampling distribution

samplingdistirbution

Central limit theorem

Overview

Warmup
Why statistics
do it yourself statistics
- three important concepts
- there is only one statistical test
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

What are statistical inferences?

inferece

Ingredients for every statistical test

Q:Did our statistic$^{*}$ occur due to chance$^{**}$?

A: To find out, look at the distribution of the statistic at the chance level!

$^{*}$ could be any measure (statistic), usually: mean, median, variance, t-statistic etc.

$^{**}$ chance-level is defined by the $H_0$ (Null Hypothesis), most often it is mean = 0

Example $n_{obs} = 10$

##  [1] -0.37645381  0.43364332 -0.58562861  1.84528080  0.57950777
##  [6] -0.57046838  0.73742905  0.98832471  0.82578135 -0.05538839

## [1] "mean:0.38, sd:0.78"

This is a simple sample from a normal distribution (dobs = rnorm(10) + 0.25 [that is a random number from a standard normal {mean=0,sd=1} plus the 0.25, in endeffect this means we sample from a normal with mean=0.25 and sd=1]). We want to know, how likely it is to see such data, assuming the POPULATION has a mean==0 (it is very obvious that the SAMPLE has a mean != 0, because the mean is 0.38)

What we need for our example:

test-statistic:
- Let’s take the mean value for simplicity (mean(experimental effect) = $\hat \mu$ = 0.38)

Chance-Model (commonly called: model of $H_0$) - could (in principle) be anything. Let’s assume a normal distribution with a mean of 0 and standard deviation ($\sigma) = 1$.

The $H_0$ model

Take 1000 random samples with $n = n_{obs} = 10$ each

dnull =replicate(1000, rnorm(10) ) %>%reshape2::melt()
null_stat = dnull%>%group_by(Var2)%>%summarise(m = mean(value)) # var2 codes the sample-ID
qplot(null_stat$m,bins=50)

Q: What is this distribution called?

The p-value

Because standardization makes things comparable!

sprintf('p-value: %.3f',sum(null_stat$m >=mean(dobs)) / 1000)

## [1] "p-value: 0.121"

The p-value is the probability to find such a statistic or larger, assumed the $H_0$ is true. Why larger? First of all, probabilities are only defined in areas. The probability to get a mean of exactly 1.23 is exactly 0. Thus why exactly this region? Here you can find an explanation based on Type-I errors (we will get to know later), and Here you find a original citation from Fisher

Note that here we took a one-sided test. We put the asumption in, that we know already on which side to look. Usually we do not know this. As a consequence we need to take: sum(abs(null_stat$m) >=mean(dobs) / 1000, that is, we map the negative values to the positive side. In effect this is the same as looking at both tails of the distribution at the same time. This will usually make your p-value bigger (i.e. $H_0$ more likely).

Analytic shortcut

Simulation can be expensive (e.g. in very complex models => day 4) and computers are needed.

For normal distributed populations (assumption!) the sampling distribution can be easily calculated (a normal with mean = $mean_{H_0}$, $SE = \frac{\sigma_{H_0}}{\sqrt N}$, Standard-error)

Summary

there is only one test

Overview

Warmup
Why statistics
do it yourself statistics
Introducing the example
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Should the seatbelt law be withdrawn?

british parliament

ggplot(data,aes(x=law,y=DriversKilled))+stat_summary()

There could be confounds!

ggplot(data,aes(x=date,y=PetrolPrice,color=factor(law)))+geom_point()

We will ignore the dependency in time. Please never do this in your data - this is an example only! You have been warned

Preview: Multiple regression to the rescue

$DriversKilled = Average+ kmDrivenEffect + TotalDriverEffect +$ $PetrolPriceEffect + SeatbeltLawEffect$

$DriversKilled = \beta_{average} + \beta_1 \cdot kmsDriven + \beta_2 \cdot TotalDrivers +$ $\beta_3 \cdot PetrolPrice + \beta_4 \cdot SeatbeltLaw + e_i$

Linear Models allow to structure and separate effects
Linear Models allow to predict values (e.g. what if PetrolPrice would be double the size?)
Linear Models can be easily extended for non-linear effects (e.g. GAM), complicated distributions(e.g. Yes/No aka logistic regression) or multiple outcomes (General Linear Model)

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

The simple linear regression

fit = lm(DriversKilled~1+kms,data)

The lm function is extremely important in R. Helpful other functions that work nicely are: predict, confint, residuals

Lines are simple! Intercept + Slope

\[ DriversKilled = \beta_0 + (kms/1000) * \beta_1 \]
Red Line: $[DriversKilled | kms/1000==20] - [DriversKilled | kms/1000==10]$

$=\beta_0 + 20\beta_1 - (\beta_0 + 10\beta_1)$ $=10\beta_1$

Which line to choose?

Residuals / L2 Norm

Minimizing the residuals maximizes the fit

L2-Norm (Least Squares): $min(|residual_i|_2)$

In our example

Data: $y = \beta_0 + x_1 \beta_1 + e_i$

Prediction: $\hat{y} = \beta_0 + x_1 \beta_1$

Residuals: $e_i = y - \hat{y}$

in R

summary(lm(formula = DriversKilled ~ 1 + kms,data=data))

## 
## Call:
## lm(formula = DriversKilled ~ 1 + kms, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.028 -19.021  -1.974  16.719  66.964 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.644e+02  9.067e+00  18.130  < 2e-16 ***
## kms         -2.774e-03  5.935e-04  -4.674  5.6e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.1 on 190 degrees of freedom
## Multiple R-squared:  0.1031, Adjusted R-squared:  0.09839 
## F-statistic: 21.84 on 1 and 190 DF,  p-value: 5.596e-06

Explained Variance

(I modified the example data for clarity)

$var(y)$ vs. $var(e)$ $R^2 = 1 - \frac{var(e)}{var(y)}$

Single outlier

Especially true for small sample sizes

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

summary(lm(DriversKilled ~ kms + drivers + PetrolPrice,data = data))

## 
## Call:
## lm(formula = DriversKilled ~ kms + drivers + PetrolPrice, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.333  -7.848  -0.532   7.544  35.036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.642e+01  1.254e+01  -2.107   0.0365 *  
## kms          7.901e-04  3.255e-04   2.428   0.0161 *  
## drivers      8.164e-02  3.429e-03  23.810   <2e-16 ***
## PetrolPrice  9.716e+00  7.911e+01   0.123   0.9024    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.53 on 188 degrees of freedom
## Multiple R-squared:  0.7969, Adjusted R-squared:  0.7937 
## F-statistic: 245.9 on 3 and 188 DF,  p-value: < 2.2e-16

We can include multiple predictors in our model. The interpretation is the same (or very similar) as with a single one. If you look at the output R gives you, the important fields are below “Coefficients”. Especially the column with “Estimate” is important. Std Error (means Standarderror) is also important, it gives you the estimate of the width of the sampling distribution. If this width is very large (compared to your estimate), you are quite unsure about your estimate, if it is small you are more sure. This is directly reflected in the ratio of the two, $\frac{estimate}{Std. Error}$ which is also called the t-value. Conveniently this is the next column. If we remember that there is only a single statistical test, one can now calculate a p-value for this t-value. It reflects the probability to see such a ratio of mean/SE assumed that the mean of the population is actually 0 (the $H_0$). Important: The last step is ONLY valid, if you ALSO assume that the data-generating mechanism (the population) uses normally-distributed errors (that is, the residuals are a sample from a normal distribution).

Interpretation: The intercept reflects the value when all other predictors are 0 (i.e. how many drivers will die when 0km are drivven by 0 drivers with a PetrolPrice of 0). It should be immediately clear that the intercept does not necessarily reflect a useful value. The nice thing about such an additive model is that each effect estimate can be interpreted without the other ones (also called additive effects)

With two predictors you span a plane

Overlapping variance

venn diagram of variance explained

Perfect redundant

data_redundant = data
data_redundant$meters = data_redundant$kms*1000
summary(lm(DriversKilled ~ meters + kms,data = data_redundant))

## 
## Call:
## lm(formula = DriversKilled ~ meters + kms, data = data_redundant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.028 -19.021  -1.974  16.719  66.964 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.644e+02  9.067e+00  18.130  < 2e-16 ***
## meters      -2.774e-06  5.935e-07  -4.674  5.6e-06 ***
## kms                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.1 on 190 degrees of freedom
## Multiple R-squared:  0.1031, Adjusted R-squared:  0.09839 
## F-statistic: 21.84 on 1 and 190 DF,  p-value: 5.596e-06

Partially redundant

data_redundant = data
data_redundant$driversnoise = data_redundant$drivers+rnorm(nrow(data_redundant),sd=10)
coef(summary(lm(DriversKilled ~ drivers,data = data_redundant)))

##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) -7.30112028 4.938443950 -1.478425 1.409499e-01
## drivers      0.07789178 0.002913364 26.736024 2.636268e-66

coef(summary(lm(DriversKilled ~ drivers+driversnoise,data = data_redundant)))

##                 Estimate Std. Error    t value  Pr(>|t|)
## (Intercept)  -7.31191485 4.95029664 -1.4770660 0.1413223
## drivers       0.04903418 0.09120555  0.5376228 0.5914700
## driversnoise  0.02885613 0.09115416  0.3165641 0.7519237

Multicollinearity is usually not so obvious

On this slide the important bits are the standard errors (SE) of the estimates. The upper model without a redundant predictor, the lower one with a redundant predictor. The top most code bit shows you how the redundancy wa introduced.

Not all statistic programs are as wellbehaved as R. Some might throw an error. In any case, if there is high redundancy, SE calculation can be off quite a bit.

multicollinearity / redundancy will be usually way less obvious. E.g. it could be, that a certain combination of your predictors together explain the same variance as a certain other combination. Thus, just with pairwise correlations between predictors you cannot find out whether you have collinearity. The variance influence factor might help you to check this. If you worry about redundancy/collinearity you should should follow that procedure.

Surpressors: Including unrelated variables can improve modelfit

## 
## Call: lm(formula = Y ~ OV + S)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01890    0.01838  -1.028    0.308    
## OV           0.99939    0.01798  55.574   <1e-04 ***
## S            0.99859    0.01807  55.252   <1e-04 ***
## 
## Residual standard error: 0.1411 on 57 degrees of freedom
## Multiple R-squared:  0.9819, Adjusted R-squared:  0.9812 
## F-statistic:  1544 on 2 and 57 DF,  p-value: < 1e-04

## 
## Call: lm(formula = Y ~ OV)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04080    0.13452  -0.303    0.763
## OV           0.01284    0.01565   0.820    0.415
## 
## Residual standard error: 1.033 on 58 degrees of freedom
## Multiple R-squared:  0.01147,    Adjusted R-squared:  -0.005571 
## F-statistic: 0.6731 on 1 and 58 DF,  p-value: 0.4153

a graphical intuition

surpressor animation

A suppressor (S) is a independent variable, that is not correlated with the Y per se, but changes the estimate of other predictors (OV). Note how the coefficients of OV are completly changed by adding another variable. The animation visualizes why this can be the case. Imagine you have a true relationship between X and Y. But then a little devil adds noise to the X variable. The relationship vanishes (this is what is animated). But if you know what kind of noise was added to X (in this case S), and include it in your model, the information of the true relationship can be recovered. If you want to understand why exactly have a look at this nice stackexchange thread

The notion of suppression is also present in descriptions of confounding. The classic example of a suppressor is a confounding hypothesis described by Horst (1941) involving the prediction of pilot performance from measures of mechanical and verbal ability. When the verbal ability predictor was added to the regression of pilot performance on mechanical ability, the effect of mechanical ability increased. This increase in the magnitude of the effect of mechanical ability on pilot performance occurred because verbal ability explained variability in mechanical ability; that is, the test of mechanical ability required verbal skills to read the test directions. taken from MacKinnon

A shorthand notation

$y_i = 1*\beta_0 + x_{i,1}*\beta_1 + x_{i,2}*\beta_2 + ... + x_{i,n}*\beta_n + e_i$

is nothing else than matrix multiplication

$y = X\beta+e$

$X = designmatrix$ with size $[n_{datapoints},n_{predictors}]$

$e = residuals$with size $[n_{datapoints}]$

Designmatrix & algebra

##     (Intercept)   kms drivers PetrolPrice
## 1             1  9059    1687  0.10297181
## 2             1  7685    1508  0.10236300
## 3             1  9963    1507  0.10206249
## 4             1 10955    1385  0.10087330
## 5             1 11823    1632  0.10101967
## 6             1 12391    1511  0.10058119
## 7             1 13460    1559  0.10377398
## 8             1 14055    1630  0.10407640
## 9             1 12106    1579  0.10377398
## 10            1 11372    1653  0.10302640
## 11            1  9834    2152  0.10273011
## 12            1  9267    2148  0.10199719
## 13            1  9130    1752  0.10127456
## 14            1  8933    1765  0.10070398
## 15            1 11000    1717  0.10013961
## 16            1 10733    1558  0.09862110
## 17            1 12912    1575  0.09834929
## 18            1 12926    1520  0.09808018
## 19            1 13990    1805  0.09727921
## 20            1 14926    1800  0.09741062
## 21            1 12900    1719  0.09742524
## 22            1 12034    2008  0.09638063
## 23            1 10643    2242  0.09573896
## 24            1 10742    2478  0.09510631
## 25            1 10266    2030  0.09673597
## 26            1 10281    1655  0.09610922
## 27            1 11527    1693  0.09536725
## 28            1 12281    1623  0.09470959
## 29            1 13587    1805  0.09411762
## 30            1 13049    1746  0.09353215
## 31            1 16055    1795  0.09295405
## 32            1 15220    1926  0.09283979
## 33            1 13824    1619  0.09272474
## 34            1 12729    1992  0.09226965
## 35            1 11467    2233  0.09170669
## 36            1 11351    2192  0.09126207
## 37            1 10803    2080  0.09071160
## 38            1 10548    1768  0.09027633
## 39            1 12368    1835  0.08995192
## 40            1 13311    1569  0.08909964
## 41            1 13885    1976  0.08867919
## 42            1 14088    1853  0.08815929
## 43            1 16932    1965  0.08890206
## 44            1 16164    1689  0.08818133
## 45            1 14883    1778  0.08894029
## 46            1 13532    1976  0.08772661
## 47            1 12220    2397  0.08742885
## 48            1 12025    2654  0.08703543
## 49            1 11692    2097  0.08644992
## 50            1 11081    1963  0.08587264
## 51            1 13745    1677  0.08539822
## 52            1 14382    1941  0.08382198
## 53            1 14391    2003  0.08459078
## 54            1 15597    1813  0.08413690
## 55            1 16834    2012  0.08377841
## 56            1 17282    1912  0.08351074
## 57            1 15779    2084  0.08280639
## 58            1 13946    2080  0.08117889
## 59            1 12701    2118  0.08285361
## 60            1 10431    2150  0.09419012
## 61            1 11616    1608  0.09239984
## 62            1 10808    1503  0.10816148
## 63            1 12421    1548  0.10721169
## 64            1 13605    1382  0.11404297
## 65            1 14455    1731  0.11245412
## 66            1 15019    1798  0.11131625
## 67            1 15662    1779  0.11030125
## 68            1 16745    1887  0.10819718
## 69            1 14717    2004  0.10702744
## 70            1 13756    2077  0.10494698
## 71            1 12531    2092  0.11935775
## 72            1 12568    2051  0.11762190
## 73            1 11249    1577  0.13302742
## 74            1 11096    1356  0.13084524
## 75            1 12637    1652  0.12831848
## 76            1 13018    1382  0.12354745
## 77            1 15005    1519  0.11858681
## 78            1 15235    1421  0.11633748
## 79            1 15552    1442  0.11516148
## 80            1 16905    1543  0.11450120
## 81            1 14776    1656  0.11352298
## 82            1 14104    1561  0.11193018
## 83            1 12854    1905  0.11061053
## 84            1 12956    2199  0.11527439
## 85            1 12177    1473  0.11379349
## 86            1 11918    1655  0.11234958
## 87            1 13517    1407  0.11175347
## 88            1 14417    1395  0.10964252
## 89            1 15911    1530  0.10844090
## 90            1 15589    1309  0.10788494
## 91            1 16543    1526  0.10908477
## 92            1 17925    1327  0.10757145
## 93            1 15406    1627  0.10616402
## 94            1 14601    1748  0.10630000
## 95            1 13107    1958  0.10482531
## 96            1 12268    2274  0.10345175
## 97            1 11972    1648  0.10144992
## 98            1 12028    1401  0.10040232
## 99            1 14033    1411  0.09886203
## 100           1 14244    1403  0.10249615
## 101           1 15287    1394  0.10302743
## 102           1 16954    1520  0.10217891
## 103           1 17361    1528  0.09983664
## 104           1 17694    1643  0.09263669
## 105           1 16222    1515  0.09181496
## 106           1 14969    1685  0.09072430
## 107           1 13624    2000  0.09002121
## 108           1 13842    2215  0.08933071
## 109           1 12387    1956  0.08844273
## 110           1 11608    1462  0.08835257
## 111           1 15021    1563  0.08675736
## 112           1 14834    1459  0.08499524
## 113           1 16565    1446  0.08456794
## 114           1 16882    1622  0.08443190
## 115           1 18012    1657  0.08435088
## 116           1 18855    1638  0.08360098
## 117           1 17243    1643  0.08341726
## 118           1 16045    1683  0.08274514
## 119           1 14745    2050  0.08523527
## 120           1 13726    2262  0.08477030
## 121           1 11196    1813  0.08445892
## 122           1 12105    1445  0.08535212
## 123           1 14723    1762  0.08755921
## 124           1 15582    1461  0.09038292
## 125           1 16863    1556  0.09078329
## 126           1 16758    1431  0.10874278
## 127           1 17434    1427  0.11414223
## 128           1 18359    1554  0.11299293
## 129           1 17189    1645  0.11132071
## 130           1 16909    1653  0.10912623
## 131           1 15380    2016  0.10769846
## 132           1 15161    2207  0.10760157
## 133           1 14027    1665  0.10377502
## 134           1 14478    1361  0.10711417
## 135           1 16155    1506  0.10737477
## 136           1 16585    1360  0.11169537
## 137           1 18117    1453  0.11063818
## 138           1 17552    1522  0.11185521
## 139           1 18299    1460  0.10974234
## 140           1 19361    1552  0.10819393
## 141           1 17924    1548  0.10625536
## 142           1 17872    1827  0.10419303
## 143           1 16058    1737  0.10193397
## 144           1 15746    1941  0.10279382
## 145           1 15226    1474  0.10476034
## 146           1 14932    1458  0.10400254
## 147           1 16846    1542  0.11665552
## 148           1 16854    1404  0.11516148
## 149           1 18146    1522  0.11298954
## 150           1 17559    1385  0.11386064
## 151           1 18655    1641  0.11911808
## 152           1 19453    1510  0.12448999
## 153           1 17923    1681  0.12322295
## 154           1 17915    1938  0.12067793
## 155           1 16496    1868  0.12104898
## 156           1 13544    1726  0.11696857
## 157           1 13601    1456  0.11275026
## 158           1 15667    1445  0.10807931
## 159           1 17358    1456  0.10883852
## 160           1 18112    1365  0.11129177
## 161           1 18581    1487  0.11130401
## 162           1 18759    1558  0.11545436
## 163           1 20668    1488  0.11476830
## 164           1 21040    1684  0.11720743
## 165           1 18993    1594  0.11907640
## 166           1 18668    1850  0.11796586
## 167           1 16768    1998  0.11744913
## 168           1 16551    2079  0.11698846
## 169           1 16231    1494  0.11261054
## 170           1 15511    1057  0.11365702
## 171           1 18308    1218  0.11314445
## 172           1 17793    1168  0.11849553
## 173           1 19205    1236  0.11796940
## 174           1 19162    1076  0.11768661
## 175           1 20997    1174  0.12005924
## 176           1 20705    1139  0.11943775
## 177           1 18759    1427  0.11888127
## 178           1 19240    1487  0.11846236
## 179           1 17504    1483  0.11801660
## 180           1 16591    1513  0.11770662
## 181           1 16224    1357  0.11777609
## 182           1 16670    1165  0.11479699
## 183           1 18539    1282  0.11573525
## 184           1 19759    1110  0.11535626
## 185           1 19584    1297  0.11481536
## 186           1 19976    1185  0.11477748
## 187           1 21486    1222  0.11493598
## 188           1 21626    1284  0.11479699
## 189           1 20195    1444  0.11409316
## 190           1 19928    1575  0.11646552
## 191           1 18564    1737  0.11602611
## 192           1 18149    1763  0.11606673

\[ y = X\beta\] Can be solved by: \[ \beta = (X^TX)^{-1}X^Ty\] (don’t memorize this)

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Dummy Coding

We code the categorical variable with 0 / 1

$y_i = \beta_0 + is\_law *\beta_1 + e_i$

Interpretation

$y_i = \beta_0 + is\_law *\beta_1 + e_i$

$\beta_0$ / Intercept: Driverskilled when $is\_law$ equals 0
$\beta_1$ / Slope: Additional drivers killed when $is\_law$ was changed to 1

summary(lm(data=data_factor,DriversKilled~1+law))

## 
## Call:
## lm(formula = DriversKilled ~ 1 + law, data = data_factor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.870 -17.870  -5.565  14.130  72.130 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           125.870      1.849  68.082  < 2e-16 ***
## lawwith seatbelt law  -25.609      5.342  -4.794 3.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.03 on 190 degrees of freedom
## Multiple R-squared:  0.1079, Adjusted R-squared:  0.1032 
## F-statistic: 22.98 on 1 and 190 DF,  p-value: 3.288e-06

Effect Coding aka sum coding aka deviation coding

Using effect coding, the interpretation of the intercept changes to the mean between groups

warning: often effect coding is made using -1 and 1, then the $\beta_1$ is halve the size! using 0.5 is much more practical

Multiple levels

(Intercept)	factormiddle	factorhigh
1	0	0
1	0	0
1	0	0
1	1	0
1	1	0
1	1	0
1	0	1
1	0	1
1	0	1

Treatment / Dummy Coding

(Intercept)	factor1	factor2
1	0.5	0.0
1	0.5	0.0
1	0.5	0.0
1	0.0	0.5
1	0.0	0.5
1	0.0	0.5
1	-0.5	-0.5
1	-0.5	-0.5
1	-0.5	-0.5

Effect Coding

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

main effect errorbars/data-points ommited for clarity

Interaction effects

interaction effects

not really helpful to learn these names

How are interactions coded

$y = \beta_0 + factorA * \beta_1 + factorB * \beta_2 + factorA * factorB * \beta_3$

Different words for the same thing:

The interaction shows how much the prediction needs to be changed, when both factors coocur
The interaction is the multiplication of the columns in X of factorA and factorB
The interaction represents the difference of differences

Difference of differences: main effect (=$\beta_1$ or $\beta_2$) are differences, the interaction represents the effect of having both differences against not having them. The difference of differences.

In the model they are coded as multiplications of the main effects (this holds true for dummy and effect coding!). In dummy coding it might be the most intuitive. We have a column with only 1’s (intercept) 1 1 1 1, we have one colum for factor A e.g. 1 1 0 0, and a third one for factorB 1 0 1 0. The interaction is the effect that happens if both factorA and factorB are 1, that is only in the first row (thus the interaction is 1 0 0 0). Appreciate that this is exactly the multiplication 1 1 0 0*1 0 1 0.

Interactions of continuous/categorical predicotrs

$y = \beta_0 + factorA * \beta_1 + contB * \beta_2 + factorA * contB * \beta_3$

## 
## Call: lm(formula = hp ~ wt * am, data = d)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -17.39      52.44  -0.332  0.74260   
## wt             47.14      13.64   3.456  0.00177 **
## am           -123.33      74.03  -1.666  0.10690   
## wt:am          63.84      25.08   2.545  0.01672 * 
## 
## Residual standard error: 44.99 on 28 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.5694 
## F-statistic: 14.66 on 3 and 28 DF,  p-value: < 1e-04

If you look at the coefficient table, we now have an interaction between the slope and a categorial variable. What the coefficients mean, depends completly on the coding scheme used. Here we used dummy coding. That means, the slope for wt(weight) is 47.14 $\frac {hp}{ton}$. That is an increase of 1 ton results in ~50 more horsepower. But this is only true, if the factor àm` (automatic) is false! Because if it is true, then the dummy-coded column has 1’s in there and the interaction kicks in. Then we have 47.14 + 63.84 $\frac {hp}{ton}$. This fact, that automatic cars have a steeper increase of horsepower per weight you can check in the plot.

Note that if you want to make the strong assumption, that there is no interaction, you can leave it out of the model. If in the true population (which you sampled) there is actually no interaction (and what we see here just happened because we had a weird/lucky sample), you will have higher power - that is you will find the true effect structure more often. But if there is actually an interaction and you did not model it, you might get big problems and conclude wrong things. Plotting & calculating at the same time might help save the day! (but beware of overfitting/HARKing!)

The difference between main and simple effects

Dummy coding tests simple effects

Effect coding main effects

Effect coding to the help!

## 
## Call: lm(formula = hp ~ wt * am, data = d)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -17.39      52.44  -0.332  0.74260   
## wt             47.14      13.64   3.456  0.00177 **
## am           -123.33      74.03  -1.666  0.10690   
## wt:am          63.84      25.08   2.545  0.01672 * 
## 
## Residual standard error: 44.99 on 28 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.5694 
## F-statistic: 14.66 on 3 and 28 DF,  p-value: < 1e-04

## 
## Call: lm(formula = hp ~ wt * am, data = d)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -79.06      37.02  -2.136   0.0416 *  
## wt             79.06      12.54   6.303   <1e-04 ***
## am1          -123.33      74.03  -1.666   0.1069    
## wt:am1         63.84      25.08   2.545   0.0167 *  
## 
## Residual standard error: 44.99 on 28 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.5694 
## F-statistic: 14.66 on 3 and 28 DF,  p-value: < 1e-04

Centering continuous predictors

$weight_{new} = weight - mean(weight)$

$y = \beta_0 + weight_{new}*\beta_1 + am*\beta_2 + weight_{new}*am*\beta_3$

What does $\beta_1$ and $\beta_0$ represent now?

## 
## Call:
## lm(formula = hp ~ wt_new * am, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.976 -24.385  -3.179  14.922  94.112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   175.29      12.32  14.231 2.42e-14 ***
## wt_new         79.06      12.54   6.303 8.12e-07 ***
## am1            82.06      24.64   3.331  0.00244 ** 
## wt_new:am1     63.84      25.08   2.545  0.01672 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.99 on 28 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.5694 
## F-statistic: 14.66 on 3 and 28 DF,  p-value: 6.266e-06

Interaction between two continuous variables

## 
## Call:
## lm(formula = read ~ math * socst, data = dSoc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.6071  -4.9228  -0.7195   4.5912  21.8592 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 37.842715  14.545210   2.602  0.00998 **
## math        -0.110512   0.291634  -0.379  0.70514   
## socst       -0.220044   0.271754  -0.810  0.41908   
## math:socst   0.011281   0.005229   2.157  0.03221 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.96 on 196 degrees of freedom
## Multiple R-squared:  0.5461, Adjusted R-squared:  0.5392 
## F-statistic: 78.61 on 3 and 196 DF,  p-value: < 2.2e-16

Interaction as a plane

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
- single parameter
- model comparison
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Remember: There is only one statistical test

one statistical test We need to specify our H0

General assumption for all of the following inference: Independence of residuals

Two general ways

sampling distribution of predictors (walds t-test)

walds-t = $\frac{estimate}{SE}$
tests a single predictor
coding scheme influences what is tested (good and bad)

Model comparison / F-test

Test how well a model with predictor vs. one without explain the data
Can test sets of predictors at the same time

Generally the two tests will agree in the case of a single predictor to be tested ($(waldsT)^2 = F$).

This will be not true when we reach GLMs and in multiple regression only for high $n$

What is a standard error?

sem 9 datapoints in sample

$SEM = \frac \sigma{\sqrt{N}}$

Confidence Interval as an area of the sampling distribution

Confidence intervals visualized

Important notes on confidence intervals

An estimated mean $\hat \mu$ of a population is either in a proposed range (e.g. confidence interval) or its not.
There is not a 95% probability that the population mean $\mu$ is in a calculated 95%-confidence interval
The 95% are attached to the confidence-interval NOT the mean
=> 95% of confidence-intervals contain the population mean $\mu$
There is no way of knowing whether your confidence interval contains $\mu$ :(

Just to be sure: Mr. Mean

mrmean

The $H_0$ for the simple regression

$\mu$ - the mean:

For the intercept (usually): Intercept = 0
For the slope (usually): slope = 0

But what is the width of the sampling distribution we should use?

In other words: How does the lm function know what the standard errors ($\frac{\sigma}{sqrt(n)}$) are?

Expert slide: Estimating standard errors from X

$\beta = (X^TX)^{-1}X^Ty$

$SE = \sqrt{diag(\sigma^2 (X^TX)^{-1})}$

Quickly summarized: it looks at partial correlations of X ($X^TX$ => Covariance, inverse => partial correlations) and weights them with the residuals’ variance $\hat \sigma$

Walds t-test

summary(fit)

## 
## Call:
## lm(formula = read ~ math * socst, data = dSoc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.6071  -4.9228  -0.7195   4.5912  21.8592 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 37.842715  14.545210   2.602  0.00998 **
## math        -0.110512   0.291634  -0.379  0.70514   
## socst       -0.220044   0.271754  -0.810  0.41908   
## math:socst   0.011281   0.005229   2.157  0.03221 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.96 on 196 degrees of freedom
## Multiple R-squared:  0.5461, Adjusted R-squared:  0.5392 
## F-statistic: 78.61 on 3 and 196 DF,  p-value: < 2.2e-16

$wald-t = \frac{\beta}{SE}$

wald-t is not normal distributed but t-distributed

What is a t-distribution?

Population $\sigma$ is unknown. Sampling $\hat \sigma$ has to be used.

=> Sampling distribution (for small n) is not normal distributed but student t-distributed (with given degrees of freedom)

If we sample with small n and we need to estimate the standard deviation of the population at the same time, our sampling distribution follows a t-distribution. With high n, the t-distribution approaches the normal distribution (see how in the example df=50 and df=100 are basically identical).

Think of degrees of freedom as independent parts of information. If you have already estimated the variance ($\hat \sigma^2$) you have $n-1$ bits of information left over. If you do not have many bits of information (= datapoints) in the first place, this will “hurt”, and you will be punished a lot (=> you are more unsure about your sampling distribution). If you have lots of information, missing one does not hurt.

Fun-fact: The t-distribution is the same we use in the socalled t-test. A shortcut around the linear model. If you have only two groups, the difference $x_1 - x_2$ divided by their SD $\hat \sigma$ follows a t-distribution assumed that the population has no difference $\mu = 0$. More to that at the end of this lecture

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
- single parameter
- model comparison
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Variance Explained / F-Test

\[R^2 = 1-\frac{var(residual_{full})}{var(residual_{reduced})}\]

In order to get the variance explained of the full/reduced model we need to fit the model twice, once with the effect and once without. Conceptually, the test boils down to:

var1 = var(resid(lm(data=d,hp~1+wt)))
var0 = var(resid(lm(data=d,hp~1)))
r^2  = 1-var1/var0

Empirical distribution:

Analytic solution

An analytical form is available here as well

Usually F-distribution is used

\[ F = \frac{\frac{R^2_{m+k}-R^2_{m}}{df_1}}{\frac{1-R^2_{m+k}}{df_2}} = \frac{R^2_{m+k}-R^2_{m}}{1-R^2_{m+k}} * \frac{df_2}{df_1}\] with bookkeeping (for 1xk ANOVA):

$df_1 = k$, with $k$ the number of predictors

$df_2 = n - m - k -1$, with $n$ the number of observations, $m$ the number of total predictors

Remember that we use analytic distributions to save use the simulation cost (and people did not have computers back then)

The F-value is the ratio of the additional variance explained by including the predictor $R^2_{m+k} - R^2{m}$ divided by the residual variance, what remains after the full model explained all data ($1 -R^2_{m+k}$). In other words, its a ratio of how much you gain divided by how much you still did not explain.

The deegrees of freedom are bookkeeping, i.e. you need to adapt your $R^2$ for how many variables you used etc.

I put these plots here, so you can appreciate the shape of the F-distribution. In the left plot, we fixed the number of observation to 30+df1 and vary df2 (approx ~ the number of parameters we exclude at the same time from the model). In the right plot we fix df1 to 3 and vary df (approx ~ number of observations).

What you should see is, that the F-value for a nullmodel (the distributions you see, which is related to the ratio of variances) gets larger if you increase the number of parameters (left plot) or if you increase the number of datapoints (right).

Overfit

overfit animation

var-explained(predictorA) $<=$ Var-explained(predictorA + predictorB)

Explained-variance is not everything. We also have to take into account the number of predictors we include

Categorical predictor with multiple levels

Airbnb prices against location in Berlin (first 900 entries)

Testing

summary(lm(price~1+ngb,dair[1:900,]))

## 
## Call:
## lm(formula = price ~ 1 + ngb, data = dair[1:900, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -237.30 -134.55  -27.47   45.21 2822.60 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  222.113     22.833   9.728   <2e-16 ***
## ngbFriedrichshain-Kreuzberg  -38.717     28.353  -1.366   0.1724    
## ngbLichtenberg               -39.647     46.980  -0.844   0.3989    
## ngbMarzahn-Hellersdorf      -124.022     71.546  -1.733   0.0834 .  
## ngbMitte                      -6.562     26.621  -0.246   0.8054    
## ngbNeukÃ¶lln                 -89.534     35.416  -2.528   0.0116 *  
## ngbPankow                    -32.324     31.064  -1.041   0.2984    
## ngbReinickendorf            -166.256     88.011  -1.889   0.0592 .  
## ngbSpandau                  -112.891     78.361  -1.441   0.1500    
## ngbSteglitz-Zehlendorf        37.183     48.933   0.760   0.4475    
## ngbTempelhof-SchÃ¶neberg     -24.770     36.215  -0.684   0.4942    
## ngbTreptow-KÃ¶penick         -62.896     52.155  -1.206   0.2282    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 224.9 on 888 degrees of freedom
## Multiple R-squared:  0.02032,    Adjusted R-squared:  0.008188 
## F-statistic: 1.675 on 11 and 888 DF,  p-value: 0.07433

xkcd comic significance jelly beans

Testing

m0 = lm(price~1    ,dair[1:900,])
m1 = lm(price~1+ngb,dair[1:900,])
anova(m0,m1)

## Analysis of Variance Table
## 
## Model 1: price ~ 1
## Model 2: price ~ 1 + ngb
##   Res.Df      RSS Df Sum of Sq      F  Pr(>F)  
## 1    899 45839663                              
## 2    888 44908040 11    931623 1.6747 0.07433 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

an graphical way to think about it

Coming back to a 2x2 design. Which order to test effects?

venn diagram

The first diagram shows how much variance each effect can explain. As you can see, there is some overlap (that is e.g. predictor A and B explain some identical variance). The graph needs to be seen chronological, you start in the first row. For Type I you explain everything there is to be explained by A. Then you add B and explain whatever is left, then you add the interaction. For type II, you remove all higher interactions containing this factor (here only A:B, but could be A:C or A:B:C as well). Then you compare a model with your factor (1 + A + B) , against a model without the factor (1 + B). Type III is very similar, but in addition you include the interaction, thus everything explained by A:B is also removed.

It’s quite safe to say that you should not use type I. There is a large controversy over choosing type II or III. I hope your data analytical choises do not influence your results, but if they do - state it! In any case report which type you use.

Model comparison does not care about recoding

## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -17.39      52.44  -0.332  0.74260   
## wt             47.14      13.64   3.456  0.00177 **
## am           -123.33      74.03  -1.666  0.10690   
## wt:am          63.84      25.08   2.545  0.01672 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.99 on 28 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.5694 
## F-statistic: 14.66 on 3 and 28 DF,  p-value: 6.266e-06

## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -79.06      37.02  -2.136   0.0416 *  
## wt             79.06      12.54   6.303 8.12e-07 ***
## am1          -123.33      74.03  -1.666   0.1069    
## wt:am1         63.84      25.08   2.545   0.0167 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.99 on 28 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.5694 
## F-statistic: 14.66 on 3 and 28 DF,  p-value: 6.266e-06

Walds-T test cares (simple vs. main effects)

## Analysis of Variance Table
## 
## Response: hp
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## wt         1  63238   63238 31.2413 5.555e-06 ***
## am         1  12700   12700  6.2744   0.01834 *  
## wt:am      1  13111   13111  6.4774   0.01672 *  
## Residuals 28  56677    2024                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Analysis of Variance Table
## 
## Response: hp
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## wt         1  63238   63238 31.2413 5.555e-06 ***
## am         1  12700   12700  6.2744   0.01834 *  
## wt:am      1  13111   13111  6.4774   0.01672 *  
## Residuals 28  56677    2024                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model comparison does not care!

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Linearity of responses
Independence
Normality
Equal variance

Linearity of responses

Linear response

Independence of residuals

Normality of residuals

Checking Assumptions: Normality using QQplots

Source: Sean Kross

There are explicit tests for normality, e.g. Lillifors or even better the Shapiro-Wilk- Test. They are problematic because they always test the H_0 that the data are normal distributed. as we already know, a test assuming H_0 cannot be used to test H_0. A different concern is, that these tests always use one specific feature to test for normality. It is quite likely that one test will give you a pass, another a fail. A common argument is, that once 30+ subjects are measured, on can invole the central limit theorem and assume that the population over subjects follows a normal distribution (because they are based on a mix of many distributions). This is a somewhat dubious argument in my interpretation. If you have large effects, these assumptions do not matter much. If data is limited, they matter a lot! I recommend reporting different kinds of robustness analysis which circumenvent some of the assumptions (e.g. test log(data), non-parametric test etc.). In the end a statistical decision (whether you believe the effect or not) will stay subjective regardless. Also bootstrapping might help a lot!

A nice trick

simulate datasets with same specifications and compare

Equal variance

homoscedasticity

homoscedasticity categorical

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

A bootstrap visualization

Why bootstrapping?

Violated assumptions
Unknown analytical + $H_0$ distribution

Bootstrap for regression

Bootstrapping in R

## 
## Call: lm(formula = read ~ math * socst, data = dSoc)
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 37.842715  14.545210   2.602  0.00998 **
## math        -0.110512   0.291634  -0.379  0.70514   
## socst       -0.220044   0.271754  -0.810  0.41908   
## math:socst   0.011281   0.005229   2.157  0.03221 * 
## 
## Residual standard error: 6.96 on 196 degrees of freedom
## Multiple R-squared:  0.5461, Adjusted R-squared:  0.5392 
## F-statistic: 78.61 on 3 and 196 DF,  p-value: < 1e-04

Continued

boot::boot.ci(bootres,type='bca',index=4) # 4 = which parameter from coef to get

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot::boot.ci(boot.out = bootres, type = "bca", index = 4)
## 
## Intervals : 
## Level       BCa          
## 95%   ( 0.0018,  0.0196 )  
## Calculations and Intervals on Original Scale

confint(fit,parm = 'math:socst')

##                   2.5 %     97.5 %
## math:socst 0.0009676529 0.02159379

We then make use of another function boot:boot.ci. This functions extracts the 95% percentile of the bootstrapped ‘samplin’ coefficient distribution (in a smart way, the bca stands for bias corrected accelerated, no space to explain here, but should be used as default). This 95% percentile corresponds to the confidence interval (because by bootstrapping we effectively build the sampling distribution)

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Fisher
Neyman-Pearson
NHST (Null Hypothesis Testing)
Bayes

Fisher

p-values are a continuous measure of evidence against the $H_0$
“Significant” p-values can be defined ad-hoc, i.e. what you think is enough evidence for this one test
a sensitivity analysis (~power analysis) can be performed but is not necessary
General idea: A single experiment should inform your decisions
for all intents and purposes, p=.049 and p=.051 constitute an identical amount of evidence against the null hypothesis source

Neyman-Pearson

p-values are only used to be threshold by $\alpha$
the pre-analysis fixed $\alpha$ denotes the error-probability to receive false positives
a power analysis is required and is part of the test
a p-value of 0.049 and 0.001 give the same evidence (with $\alpha=0.05$)
General idea: Its all about long-term properties. A single experiment does not really help

Type I vs. Type II errors

type1errors

image source

Type Errors

NHST

A dark mix between Fisher & Neyman-Pearson is currently used. This is evident by calculating a p-value, p-values below $\alpha$ are significant (NP). But exact p-values are still reported.

This is a very popular mix - but is not consistent. Either you want to accept / reject a finding (and be correct on the long run) or you want to identify how much evidence you have from your sample (and act according to how much evidence you have).

Bayes

parameter estimates (combined with the apropriate aka fully informed prior) reflect the probability of the true parameter being this value
P-Values are of no use - one can contrast two opposing hypotheses directly (e.g. model-selection)
usually not interested in testing hypotheses, but more on estimating likely ranges of parameters

An intuition on the difference between Bayes & Frequentist

Frequentist:
- All about the sampling distribution
- probabilities are frequencies of (hypothetical) outcomes
- Hypothetical Population and experiments (=> sampling distribution)
- A population-parameter has a true value
- “top down”

Bayesian:
- All about the (subjective*) probabilities of parameters
- probabilities are plausibilities/certainties
- population-parameters are uncertain
- “bottom up”

They can result in exactly the same numbers. But interpretation vastly different!

sometimes called subjective because of subjective choice of priors

Be both if you can!

Come join our Seminar: “Statistical Rethinking Reading Club” next semester

statistical rethinking book

“This is a rare and valuable book that combines readable explanations, computer code, and active learning.” —Andrew Gelman, Columbia University

“…an impressive book that I do not hesitate recommending for prospective data analysts and applied statisticians!” —Christian Robert, Université Paris-Dauphine (review)

“A pedagogical masterpiece…” —Rasmus Bååth, Lund University

“The content of this book has been developed over a decade+ of McElreath’s teaching and mentoring of graduate students, post docs, and other colleagues, and it really shows.” —Brian Wood, Yale

“…omg suddenly everything makes sense…” —Ecstatic anonymous reader

Overview

Warmup
Why statistics
do it yourself statistics
Linear Regression
Multiple Regression
Categorical Variables
Interactions and the famous 2x2 design
Inference
Asumptions
Bootstrapping
Four ways of statistics
Equivalence of traditional tests

Correlation

\[r = \frac{cov_{XY}}{\sigma_X\sigma_Y}\] \[ Y = \beta_0 + X\beta_1\] \[ r = \beta_1\frac{\sigma_X}{\sigma_Y}\]

Alternative: standardize both Y & X (e.g. $Y^* = \frac{Y}{\sigma_Y}$) before fitting the regression, you will receive correlations

t-test

Test for two groups: $t = \frac{\hat \mu_1 - \hat \mu_2}{\sqrt{SE_1^2 + SE_2^2}}$

t.test(data$DriversKilled[data$law==0],data$DriversKilled[data$law==1],var.equal = T)

## 
##  Two Sample t-test
## 
## data:  data$DriversKilled[data$law == 0] and data$DriversKilled[data$law == 1]
## t = 4.7942, df = 190, p-value = 3.288e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  15.07239 36.14552
## sample estimates:
## mean of x mean of y 
##  125.8698  100.2609

print(summary(lm(data=data,DriversKilled~law)),concise=T)

## 
## Call: lm(formula = DriversKilled ~ law, data = data)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  125.870      1.849  68.082   <1e-04 ***
## law          -25.609      5.342  -4.794   <1e-04 ***
## 
## Residual standard error: 24.03 on 190 degrees of freedom
## Multiple R-squared:  0.1079, Adjusted R-squared:  0.1032 
## F-statistic: 22.98 on 1 and 190 DF,  p-value: < 1e-04

ANOVA

d = data.frame(datasets::mtcars)
d$gear = factor(d$gear)
summary(aov(hp~gear,data=d))

##             Df Sum Sq Mean Sq F value  Pr(>F)    
## gear         2  64213   32106   11.42 0.00022 ***
## Residuals   29  81514    2811                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

l1 = lm(hp~1+gear,data=d)
l0 = lm(hp~1,data=d)
anova(l1,l0)

## Analysis of Variance Table
## 
## Model 1: hp ~ 1 + gear
## Model 2: hp ~ 1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     29  81514                                  
## 2     31 145727 -2    -64213 11.422 0.0002196 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You could also use the anovaon the linea model output to get the anova table

ANCOVA

summary(aov(price~factor(bedrooms)+reviews,dair[1:800,]))

##                   Df   Sum Sq Mean Sq F value Pr(>F)    
## factor(bedrooms)   8  7797765  974721  26.238 <2e-16 ***
## reviews            1    28260   28260   0.761  0.383    
## Residuals        790 29347410   37149                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

l1 = lm(price~1+factor(bedrooms)+reviews,data=dair[1:800,])
l0 = lm(price~1+factor(bedrooms),data=dair[1:800,])
anova(l0,l1)

## Analysis of Variance Table
## 
## Model 1: price ~ 1 + factor(bedrooms)
## Model 2: price ~ 1 + factor(bedrooms) + reviews
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1    791 29375670                           
## 2    790 29347410  1     28260 0.7607 0.3834

recommendations

I like these books: - Statistical Rethinking - (more math) Advanced Data Analysis from an elemental point view (Cosma Shalizi)

I like these webpages: https://www.flutterbys.com.au/stats/

And thats it!

statistic cat

Applied Generalized Linear (Mixed) Modelling

Benedikt Ehinger

March 19th, 2018

Slides & my Presentation style

Organization

Exam

Athena: https://athena.cogsci.uos.de/

Course Overview

Course Overview

Overview

Purpose of statistics

Why statistics in the first place?

Most important lesson

Overview

Three extremely important concepts

Population

Sample

Sampling Distribution

Overview

What are statistical inferences?

Ingredients for every statistical test

Example \(n_{obs} = 10\)

What we need for our example:

The \(H_0\) model

The p-value

Analytic shortcut

Summary

Overview

Should the seatbelt law be withdrawn?

There could be confounds!

Preview: Multiple regression to the rescue

Overview

The simple linear regression

Lines are simple! Intercept + Slope

Which line to choose?

Residuals / L2 Norm

In our example

in R

Explained Variance

Single outlier

Overview

With two predictors you span a plane

Overlapping variance

Perfect redundant

Partially redundant

Surpressors: Including unrelated variables can improve modelfit

a graphical intuition

A shorthand notation

Designmatrix & algebra

Overview

Dummy Coding

Dummy Coding

Interpretation

Effect Coding aka sum coding aka deviation coding

Multiple levels

Overview

Interaction effects

How are interactions coded

Interactions of continuous/categorical predicotrs

The difference between main and simple effects

Effect coding to the help!

Centering continuous predictors

Interaction between two continuous variables

Interaction as a plane

Overview

Two general ways

sampling distribution of predictors (walds t-test)

Model comparison / F-test

What is a standard error?

Confidence Interval as an area of the sampling distribution

Confidence intervals visualized

Important notes on confidence intervals

Just to be sure: Mr. Mean

The \(H_0\) for the simple regression

Expert slide: Estimating standard errors from X

Walds t-test

What is a t-distribution?

Overview

Variance Explained / F-Test

Empirical distribution: