Blog

Simple Filter Generation

I sometimes explain concepts to my students. Then I forget the code and the next time round, I have to redo it. Take this post as an extended memory-post. In this case I showed what filter-ringing artefacts could look like. This is especially important for EEG preprocessing where filtering is a standard procedure.

A good introduction to filtering can be found in this slides by andreas widmann or this paper by andreas widmann

Impulse with noise

I simulated as simple impulse response with some additional noise. The idea is to show the student that big spikes in the EEG-data could result in filter-ringing that is quite substantial and far away from the spike. The filter

This is the filter I generated. It is a BAD filter. It shows huge passband ripples. But for educational purposes it suits me nice. I usually explain what passbands, transitionband, stopband, ripples and attenuation means. The code

“matlab
fs = 1000;
T = 2;
time = 0:1/fs:T-1/fs;

data = zeros(length(time),1);
% data(end/2:end) = data(end/2:end) + 1;
data(end/2) = data(end/2) + 1;
data = data + rand(size(data))*0.02;

subplot(2,1,1)
plot(data)

filtLow = designfilt(‘lowpassfir’,’PassbandFrequency’,100, …
‘StopbandFrequency’,110,’PassbandRipple’,0.5, …
‘StopbandAttenuation’,65,’SampleRate’,1000);

subplot(2,1,2)

% 0-padding to get the borders right

% Filter twice to get the same phase(non-causal)
a = filter(filtLow,data);
b = filter(filtLow,a(end:-1:1));
b = b(round(filtLow.filtord)+1:end – round(filtLow.filtord));
plot(b)

fvtool(filtLow) % to look at the filter

“

Logistic Regression: Will it rain in Osnabrück tomorrow?

TLDR;

52% of days it rained (Precipitation>0)

Is it sunny? 2x as likely that it is sunny tomorrow as well.

Is it rainy? 2.3x as likely that it is rainy tomorrow as well

Predicting rainfall using logistic regression

I’m playing around with some analyses for an upcoming course. This follows loosely the example of “Advanced Data Analysis from an Elemental Point of View” Chapter 12.7
It is a somewhat naive analysis. Further improvements are discussed at the end.

library(ggplot2)
library(plyr)

We load the data and change some of the German variables

weather$date = as.Date.character(x=weather$MESS_DATUM,format="%Y%m%d")
weather = rename(weather,replace = c("NIEDERSCHLAGSHOEHE"="rain"))
weather$rain_tomorrow = c(tail(weather$rain,-1),NA)

weather <- weather[-nrow(weather),] # remove NA row
summary(weather[,c("date","rain","rain_tomorrow")])
##       date                 rain         rain_tomorrow
##  Min.   :1989-10-01   Min.   :  0.000   Min.   :  0.000
##  1st Qu.:1996-04-23   1st Qu.:  0.000   1st Qu.:  0.000
##  Median :2002-11-15   Median :  0.100   Median :  0.100
##  Mean   :2002-11-15   Mean   :  2.055   Mean   :  2.055
##  3rd Qu.:2009-06-07   3rd Qu.:  2.200   3rd Qu.:  2.200
##  Max.   :2015-12-30   Max.   :140.100   Max.   :140.100

The data range from 1989 up until the end of 2015.

p1= ggplot(weather,aes(x=rain))+geom_density()
cowplot::plot_grid(p1,p1+scale_x_log10()) Notice that the plot on the left is in native scale, the one on the right in x-axis-log-scale.
We take the mean of rainy days (ml/day > 0): There is a 52% probability it will rain on a given day (what everyone living in Osnabrück already knew, it rains more often than not).

Predicting rain from the day before

For exercise reasons I made several logistic regressions. I try to predict whether there will be rain on the day afterwards, based on the amount of rain on the current day.

m.weather.1 = glm(formula = rain_tomorrow>0 ~ rain>0,data=weather,family = "binomial")
summary(m.weather.1)
##
## Call:
## glm(formula = rain_tomorrow > 0 ~ rain > 0, family = "binomial",
##     data = weather)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.5431  -0.8889   0.8514   0.8514   1.4965
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.72478    0.03136  -23.11   <2e-16 ***
## rain > 0TRUE  1.55287    0.04400   35.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 13278  on 9586  degrees of freedom
## Residual deviance: 11938  on 9585  degrees of freedom
## AIC: 11942
##
## Number of Fisher Scoring iterations: 4

Whoop Whoop, Wald’s t-value of ~35! Keep in mind that the predictor values are in logit-space. That means, a predictor value of -0.72 is a log-odd value. In order to calculate this back to an easier interpreted format, we can simply take the exponential and receive the odds-ratio.

Do the calculations:

exp(coef(m.weather.1))
## (Intercept)
##   0.4844302
exp(sum(coef(m.weather.1))) # the odds for rain tomorrow if it is rainy (2 : 1)
##  2.288933

Now we can interprete the odds:

The odds of rain tomorrow if there was a sunny day are 0.5:1, i.e. it is double as likely that the next day is sunny, if it was sunny today
The odds of rain tomorrow if it was a rainy day are 2.3:1, i.e. it is more than double as likely that the next day is rainy, if it was rainy today.

Can we get better?

We could try to predict rain based on the daily continuous precipitation (in ml).
We will compare this with the previous model using BIC (smaller number => better model).

m.weather.2 = glm(formula = rain_tomorrow>0 ~ rain,data=weather,family="binomial")
BIC(m.weather.1,m.weather.2)
##             df      BIC
## m.weather.1  2 11956.09
## m.weather.2  2 12505.85

Turns out: No, a categorical model (is there rain today, or not?) beats the continuous one. But why?

d.predNew = data.frame(rain=seq(0,50));
d.pred=             cbind(d.predNew,pred=arm::invlogit(predict(m.weather.2,newdata = d.predNew)),model='2')
d.pred=rbind(d.pred,cbind(d.predNew,pred=arm::invlogit(predict(m.weather.1,newdata = d.predNew)),model='1'))

p1 = ggplot()+geom_point(data=d.pred,aes(x=rain,y=pred,color=model) )+
geom_smooth(data=weather,aes(x=rain,y=(rain_tomorrow>0)*1))
cowplot::plot_grid(p1,p1+scale_x_log10()) We plot the predictions model 1 does in green, model 2 in red and the smoothed data in blue. On the left we have the x-axis in native units [ml?] on the right on log-scale. It is very clear, that the red dots do not match the empirical data (blue) at all. The green dots (model 1) are better. My guess is, this is due to some outlier influencing the slope, but also due to the non-homogenious distribution of rain-values as seen i the first figure
We will try two transformations off X (a log effect seems possible?)

• sqrt(rain)
• log(rain+0.001)
m.weather.3 = glm(formula = rain_tomorrow>0 ~ sqrt(rain),data=weather,family="binomial")
# Box-Cox transform with lambda2 = 0.001 http://robjhyndman.com/hyndsight/transformations/
m.weather.4 = glm(formula = rain_tomorrow>0 ~ log(rain+0.001),data=weather,family="binomial")

BIC(m.weather.1,m.weather.2,m.weather.3,m.weather.4)
##             df      BIC
## m.weather.1  2 11956.09
## m.weather.2  2 12505.85
## m.weather.3  2 12020.09
## m.weather.4  2 11795.15

It is surprisingly hard to beat the simple first model, but in the end – we did it! The log(rain) model seems to capture the data best.

d.predNew = data.frame(rain=seq(0,50));
d.pred=             cbind(d.predNew,pred=arm::invlogit(predict(m.weather.4,newdata = d.predNew)),model='2')
d.pred=rbind(d.pred,cbind(d.predNew,pred=arm::invlogit(predict(m.weather.1,newdata = d.predNew)),model='1'))
d.pred=rbind(d.pred,cbind(d.predNew,pred=arm::invlogit(predict(m.weather.3,newdata = d.predNew)),model='3'))
d.pred=rbind(d.pred,cbind(d.predNew,pred=arm::invlogit(predict(m.weather.4,newdata = d.predNew)),model='4'))

p1 = ggplot()+geom_point(data=d.pred,aes(x=rain,y=pred,color=model) )+
geom_smooth(data=weather,aes(x=rain,y=(rain_tomorrow>0)*1))
cowplot::plot_grid(p1,p1+scale_x_log10()) Visually we can see that model 4 comes the non-linear smoother the closest.

Improvements

• use nonlinear effects (GAM) as done in the book
• multivariate (make use of the plentitude of other information as well)
• use the months as a cyclical predictor, i.e. seasonal trends will be clearly present in the data

Misc

This post was directly exported using knit2wp and R-Markdown. It works kind of okay, but clearly the figures are not wide enough, even though I specify the correct width in the markdown. I might upload them later manually.

How to use bimodal priors for bayesian data analysis in STAN

I tried adding a bi-modal prior in STAN for a homework exercise on Bayesian Data Analysis. At first, I thought this could work:
“STAN
model{
mu ~ normal(-0.5,0.3) + normal(0.5,0.3);
}
“
But it doesn’t. I had to dig deeper and I thought I could simply add multiple times to the log-posterior due to a sideremark of Bob Carpenter:

“STAN
target += normal_lpdf(mu|.5,0.3);
target += normal_lpdf(mu|-.5,0.3);
“
Which also does not work. Finally, the solution is akin to the mixture model in the STAN manual:

“STAN
target += log_sum_exp(normal_lpdf(mu|.5,0.3),normal_lpdf(mu|-.5,0.3));
“

This results in beautiful bi-modal priors: I did not find anything on google or the manual of how to do this. If there is a smarter way to do it, please leave a comment.

“R

library(rstan)

model <- " data { } parameters { real mu; } transformed parameters { } model { //mu ~ normal(10,1); //mu ~ normal(-10,1); target += log_sum_exp(normal_lpdf(mu|-.5,.3),normal_lpdf(mu|.5,.3)); }" samples <- stan(model_code=model, iter=2000, chains=4, thin=1, # seed=123 # Setting seed; Default is random seed ) ggmcmc::ggs_density(ggmcmc::ggs(samples))+theme_minimal() 

Scatterplots, regression lines and the first principal component

I made some graphs that show the relation between X1~X2 (X2 predicts X1), X2~X1 (X1 predicts X2) and the first principal component (direction with highest variance, also called total least squares).

The line you fit with a principal component is not the same line as in a regression (either predicting X2 by X1 [X2~X1] or X1 by X2 [X1~X2]. This is quite well known (see references below). With regression one predicts X2 based on X1 (X2~X1 in R-Formula writing) or vice versa. With principal component (or total least squares) one tries to quantify the relation between the two. To completely understand the difference, image what quantity is reduced in the three cases.

In regression, we reduce the residuals in direction of the dependent variable. With principal components, we find the line, that has the smallest error orthogonal to the regression line. See the following image for a visual illustration. For me it becomes interesting if you plot a scatter plot of two independent variables, i.e. you would usually report the correlation coefficient. The ‘correct’ line accompaniyng the correlation coefficient would be the principal component (‘correct’ as it is also agnostic to the order of the signals).

Further information:

How to draw the line, eLife 2013
Gelman, Hill – Data Analysis using Regression, p.58
Also check out the nice blogpost from Martin Johnsson doing practically the same thing but three years earlier 😉

Source

“R
library(ggplot2)
library(magrittr)
library(plyr)

set.seed(2)
corrCoef = 0.5 # sample from a multivariate normal, 10 datapoints
dat = MASS::mvrnorm(10,c(0,0),Sigma = matrix(c(1,corrCoef,2,corrCoef),2,2))
dat[,1] = dat[,1] – mean(dat[,1]) # it makes life easier for the princomp
dat[,2] = dat[,2] – mean(dat[,2])

dat = data.frame(x1 = dat[,1],x2 = dat[,2])

# Calculate the first principle component
# see http://stats.stackexchange.com/questions/13152/how-to-perform-orthogonal-regression-total-least-squares-via-pca

Checks

• I can repeat Figure 1 from Phipson & Smyth 2010
• I get a flat histogram for two null effects
• I get 5% false positives at alpha=0.05

Future Implementation

• Add custom – function ability to select your own test-statistic (now its simply mean between groups). e.g. t-stat