{"id":183,"date":"2017-01-13T17:10:29","date_gmt":"2017-01-13T15:10:29","guid":{"rendered":"http:\/\/benediktehinger.de\/blog\/science\/?p=183"},"modified":"2017-01-13T19:53:02","modified_gmt":"2017-01-13T17:53:02","slug":"scatterplots-regression-lines-and-the-first-principal-component","status":"publish","type":"post","link":"https:\/\/benediktehinger.de\/blog\/science\/scatterplots-regression-lines-and-the-first-principal-component\/","title":{"rendered":"Scatterplots, regression lines and the first principal component"},"content":{"rendered":"<p>I made some graphs that show the relation between X1~X2 (X2 predicts X1), X2~X1 (X1 predicts X2) and the first principal component (direction with highest variance, also called total least squares).<\/p>\n<p>The line you fit with a principal component is not the same line as in a regression (either predicting X2 by X1 [X2~X1] or X1 by X2 [X1~X2]. This is quite well known (see references below).<br \/>\n<a href=\"http:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2017\/01\/Rplot01.png\"><img loading=\"lazy\" decoding=\"async\" class=\"style-svg\" src=\"http:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2017\/01\/Rplot01.png\" alt=\"\" width=\"779\" height=\"491\" \/><\/a><\/p>\n<p>With regression one predicts X2 based on X1 (X2~X1 in R-Formula writing) or vice versa. With principal component (or total least squares) one tries to quantify the relation between the two. To completely understand the difference, image what quantity is reduced in the three cases.<\/p>\n<p>In regression, we reduce the residuals in direction of the dependent variable. With principal components, we find the line, that has the smallest error orthogonal to the regression line. See the following image for a visual illustration.<br \/>\n<a href=\"http:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2017\/01\/Rplot.png\"><img loading=\"lazy\" decoding=\"async\" class=\"style-svg\" src=\"http:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2017\/01\/Rplot.png\" alt=\"\" width=\"945\" height=\"300\" \/><\/a><\/p>\n<p>For me it becomes interesting if you plot a scatter plot of two independent variables, i.e. you would usually report the correlation coefficient. The &#8216;correct&#8217; line accompaniyng the correlation coefficient would be the principal component (&#8216;correct&#8217; as it  is also agnostic to the order of the signals).<\/p>\n<h4>Further information:<\/h4>\n<p><a href=\"https:\/\/elifesciences.org\/content\/2\/e00638\">How to draw the line, eLife 2013<\/a><br \/>\nGelman, Hill &#8211; Data Analysis using Regression, p.58<br \/>\n<a href=\"https:\/\/martinsbioblogg.wordpress.com\/2013\/05\/31\/how-to-draw-the-line-with-ggplot2\/\"> Also check out the nice blogpost from Martin Johnsson doing practically the same thing but three years earlier \ud83d\ude09 <\/a><\/p>\n<h4>Source<\/h4>\n<p>&#8220;`R<br \/>\nlibrary(ggplot2)<br \/>\nlibrary(magrittr)<br \/>\nlibrary(plyr)<\/p>\n<p>set.seed(2)<br \/>\ncorrCoef = 0.5 # sample from a multivariate normal, 10 datapoints<br \/>\ndat = MASS::mvrnorm(10,c(0,0),Sigma = matrix(c(1,corrCoef,2,corrCoef),2,2))<br \/>\ndat[,1] = dat[,1] &#8211; mean(dat[,1]) # it makes life easier for the princomp<br \/>\ndat[,2] = dat[,2] &#8211; mean(dat[,2])<\/p>\n<p>dat = data.frame(x1 = dat[,1],x2 = dat[,2])<\/p>\n<p># Calculate the first principle component<br \/>\n# see http:\/\/stats.stackexchange.com\/questions\/13152\/how-to-perform-orthogonal-regression-total-least-squares-via-pca<br \/>\nv = dat%>%prcomp%$%rotation<br \/>\nx1x2cor = bCor = v[2,1]\/v[1,1]\n<p>x1tox2 = coef(lm(x1~x2,dat))<br \/>\nx2tox1 = coef(lm(x2~x1,dat))<br \/>\nslopeData =data.frame(slope = c(x1x2cor,1\/x1tox2[2],x2tox1[2]),type=c(&#8216;Principal Component&#8217;,&#8217;X1~X2&#8242;,&#8217;X2~X1&#8242;))<\/p>\n<p># We want this to draw the neat orthogonal lines.<br \/>\npointOnLine = function(inp){<br \/>\n  # y = a*x + c (c=0)<br \/>\n  # yOrth = -(1\/a)*x + d<br \/>\n  # yOrth = b*x + d<br \/>\n    x0 = inp[1]\n    y0 = inp[2]\n    a = x1x2cor<br \/>\n    b = -(1\/a)<br \/>\n    c = 0<br \/>\n    d = y0 &#8211; b*x0<br \/>\n    x = (d-c)\/(a-b)<br \/>\n    y = -(1\/a)*x+d<br \/>\n    return(c(x,y))<br \/>\n}<br \/>\npoints = apply(dat,1,FUN=pointOnLine)<\/p>\n<p>segmeData = rbind(data.frame(x=dat[,1],y=dat[,2],xend=points[1,],yend=points[2,],type = &#8216;Principal Component&#8217;),<br \/>\n                  data.frame(x=dat[,1],y=dat[,2],yend=dat[,1]*x2tox1[2],xend=dat[,1],type=&#8217;X2~X1&#8242;),<br \/>\n                  data.frame(x=dat[,1],y=dat[,2],yend=dat[,2],xend=dat[,2]*x1tox2[2],type=&#8217;X1~X2&#8242;))<\/p>\n<p>ggplot(aes(x1,x2),data=dat)+geom_point()+<br \/>\n  geom_abline( data=slopeData,aes(slope = slope,intercept=0,color=type))+<br \/>\n  theme_minimal(20)+coord_equal()<\/p>\n<p>ggplot(aes(x1,x2),data=dat)+geom_point()+<br \/>\n  geom_abline( data=slopeData,aes(slope = slope,intercept=0,color=type))+<br \/>\n  geom_segment(data=segmeData,aes(x=x,y=y,xend=xend,yend=yend,color=type))+facet_grid(.~type)+coord_equal()+theme_minimal(20)<br \/>\n&#8220;`<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I made some graphs that show the relation between X1~X2 (X2 predicts X1), X2~X1 (X1 predicts X2) and the first principal component (direction with highest variance, also called total least squares). The line you fit with a principal component is not the same line as in a regression (either predicting X2 by X1 [X2~X1] or X1 by X2 [X1~X2]. This is quite well known (see references below). With regression one predicts X2 based on X1 (X2~X1 in R-Formula writing) or vice versa. With principal component (or total least squares) one tries to quantify the relation between the two. To completely&#8230;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-183","post","type-post","status-publish","format-standard","hentry","category-blog"],"_links":{"self":[{"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/posts\/183","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/comments?post=183"}],"version-history":[{"count":0,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/posts\/183\/revisions"}],"wp:attachment":[{"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/media?parent=183"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/categories?post=183"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/tags?post=183"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}