{"id":69,"date":"2016-03-16T18:05:47","date_gmt":"2016-03-16T16:05:47","guid":{"rendered":"http:\/\/benediktehinger.de\/blog\/science\/?p=69"},"modified":"2016-04-27T13:40:41","modified_gmt":"2016-04-27T11:40:41","slug":"matlab-winsorized-mean-over-all-dimension","status":"publish","type":"post","link":"https:\/\/benediktehinger.de\/blog\/science\/matlab-winsorized-mean-over-all-dimension\/","title":{"rendered":"Matlab winsorized mean over all dimension"},"content":{"rendered":"<p>This is a function I wrote back in 2014. I think it illustrates an advanced functionality in matlab that I hadn&#8217;t found written about before.<\/p>\n<h3>The problem:<\/h3>\n<p>Calculate the winsorized mean of a multidimensional matrix over an arbitrary dimension.<\/p>\n<h3>Winsorized Mean<\/h3>\n<p>The benefits of the winsorized mean can be seen here:<br \/>\n<a href=\"http:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2016\/03\/Unbenannt-3.png\" rel=\"attachment wp-att-71\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2016\/03\/Unbenannt-3-1024x258.png\" alt=\"Unbenannt-3\" width=\"1024\" height=\"258\" class=\"alignnone size-large wp-image-71\" srcset=\"https:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2016\/03\/Unbenannt-3-1024x258.png 1024w, https:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2016\/03\/Unbenannt-3-300x76.png 300w, https:\/\/benediktehinger.de\/blog\/science\/upload\/sites\/2\/2016\/03\/Unbenannt-3-768x194.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/p>\n<p>We replace the top 10% and bottom 10% by the remaining most extreme value before calculating the mean (left panel). The right panel shows how the mean is influenced by a single outlier, but the winsorized mean is not (ignore the &#8220;yuen&#8221;-box&#8221;)<\/p>\n<h3>Current Implementation<\/h3>\n<p>I adapted an implementation from the <a href=\"https:\/\/github.com\/LIMO-EEG-Toolbox\/limo_eeg\/blob\/master\/limo_winvar.m\"> LIMO toolbox <\/a> based on Original Code from Prof. patrick J Bennett, McMaster University. In this code the dimension is fixed at dim = 3, the third dimension.<\/p>\n<p>They solve it in three steps:<\/p>\n<ol>\n<li>sort the matrix along dimension 3<\/li>\n[matlab] xsort=sort(x,3); [\/matlab]\n<li>replace the upper and lower 10% by the remaining extreme value<\/li>\n[matlab]\n% number of items to winsorize and trim<br \/>\ng=floor((percent\/100)*n);<br \/>\nwx(:,:,1:g+1)=repmat(xsort(:,:,g+1),[1 1 g+1]);<br \/>\nwx(:,:,n-g:end)=repmat(xsort(:,:,n-g),[1 1 g+1]);<br \/>\n[\/matlab]\n<li>calculate the mean over the sorted matrix<\/li>\n[matlab]wvarx=var(wx,0,3);[\/matlab]\n<\/ol>\n<h3>Generalisation<\/h3>\n<p>To generalize this to any dimension I have seen two previous solution that feels unsatisfied:<br \/>\n &#8211; Implement it for up to X dimension hardcoded and then use a switch-case to get the solution for the case.<br \/>\n &#8211; use permute to reorder the array and then go for the first dimension (which can be slow depending on the array)<\/p>\n<p>Let&#8217;s solve it for X = 20 x 10 x 5 x 2 over the third dimension<br \/>\n[matlab]\n<p>function [x] = winMean(x,dim,percent)<br \/>\n% x = matrix of arbitrary dimension<br \/>\n% dim = dimension to calculate the winsorized mean over<br \/>\n% percent = default 20, how strong to winsorize<\/p>\n<p>% How long is the matrix in our required dimension<br \/>\nn=size(x,dim);<br \/>\n% number of items to winsorize and trim<br \/>\ng=floor((percent\/100)*n);<br \/>\nx=sort(x,dim);<\/p>\n[\/matlab]\nup to here it my and the original version are very similar. The hardest part is to generalize the part, where the entries are overwritten without doing it in a loop.<br \/>\nWe are now using the <a href=\"http:\/\/de.mathworks.com\/help\/matlab\/ref\/subsasgn.html\">subsasgn command<\/a> and <a href=\"http:\/\/de.mathworks.com\/help\/matlab\/ref\/subsref.html\"> subsref <\/a><br \/>\nWe need to generate a structure that mimics the syntax of<br \/>\n[matlab] x(:,:,1:g+1,:) = y [\/matlab]\nfor arbitrary dimensions and we need to construct <em>y<\/em><\/p>\n[matlab]\n% Prepare Structs<br \/>\nSrep.type = &#8216;()&#8217;;<br \/>\nS.type = &#8216;()&#8217;;<\/p>\n<p>% replace the left hand side<br \/>\nnDim = length(size(x));<\/p>\n<p>beforeColons = num2cell(repmat(&#8216;:&#8217;,dim-1,1));<br \/>\nafterColons  = num2cell(repmat(&#8216;:&#8217;,nDim-dim,1));<br \/>\nSrep.subs = {beforeColons{:} [g+1]    afterColons{:}};<br \/>\nS.subs    = {beforeColons{:} [1:g+1]  afterColons{:}};<br \/>\nx = subsasgn(x,S,repmat(subsref(x,Srep),[ones(1,dim-1) g+1 ones(1,nDim-dim)])); % general case<br \/>\n[\/matlab]\nThe output of <em>Srep<\/em> is:<\/p>\n<blockquote><p>\n  Srep =<br \/>\n    type: &#8216;()&#8217;<br \/>\n    subs: {&#8216;:&#8217;   &#8216;:&#8217;  [2]  &#8216;:&#8217; }\n<\/p><\/blockquote>\n<p>thus subsref(x,Srep) outputs what x(:,:,2,:) would output. And then we need to repmat it, to fit the number of elements we replace by the winsorizing method.<\/p>\n<p>This is put into subsasgn, where the S here is :<\/p>\n<blockquote><p>\n  Srep =<br \/>\n    type: &#8216;()&#8217;<br \/>\n    subs: {&#8216;:&#8217;   &#8216;:&#8217;  [1 2]  &#8216;:&#8217; }\n<\/p><\/blockquote>\n<p>Thus equivalent to x(:,:,[1 2],:).<br \/>\nThe evaluated structure then is:<br \/>\n[matlab] x(:,:,[1:2]) = repmat(x[:,:,1],[1 1 2 1]) [\/matlab]\n<p>The upper percentile is replaced analogous:<br \/>\n[matlab]\n% replace the right hand side<br \/>\nSrep.subs = {beforeColons{:} [n-g]            afterColons{:}};<br \/>\nS.subs    = {beforeColons{:} [n-g:size(x,dim)]  afterColons{:}};<\/p>\n<p>x = subsasgn(x,S,repmat(subsref(x,Srep),[ones(1,dim-1) g+1 ones(1,nDim-dim)])); % general case<\/p>\n[\/matlab]\n<p>And in the end we can take the mean, var, nanmean or whatever we need:<br \/>\n[matlab]\nx = squeeze(nanmean(x,dim));<br \/>\n[\/matlab]\n<p>That finishes the implementation.<\/p>\n<h3>Timing<\/h3>\n<p>But how about speed? I thus generated a random matrix of 200 x 10000 x 5 and measured the timing (n=100 runs) of the original limo implementation and mine:<\/p>\n<table>\n<thead>\n<tr>\n<th><strong>algorithm<\/strong><\/th>\n<th><strong>timing (95% bootstraped CI of mean)<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>limo_winmean<\/td>\n<td>185 &#8211; 188 ms<\/td>\n<\/tr>\n<tr>\n<td>my_winmean<\/td>\n<td>202 &#8211; 203ms<\/td>\n<\/tr>\n<tr>\n<td>limo_winmean otherDimension than 3&nbsp;&nbsp;&nbsp;<\/td>\n<td>218 &#8211; 228 ms<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For the last comparison I permuted the array prior to calculating the winsorized mean, thus the overhead. In my experience, the overhead is greater the larger the arrays are (I&#8217;m talking about 5-10GB matrices here).<\/p>\n<h2>Conclusion<\/h2>\n<p>My generalization seems to work fine. As expected it is slower than the hardcoded version. But it is faster than permuting the whole array.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a function I wrote back in 2014. I think it illustrates an advanced functionality in matlab that I hadn&#8217;t found written about before. The problem: Calculate the winsorized mean of a multidimensional matrix over an arbitrary dimension. Winsorized Mean The benefits of the winsorized mean can be seen here: We replace the top 10% and bottom 10% by the remaining most extreme value before calculating the mean (left panel). The right panel shows how the mean is influenced by a single outlier, but the winsorized mean is not (ignore the &#8220;yuen&#8221;-box&#8221;) Current Implementation I adapted an implementation from&#8230;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-69","post","type-post","status-publish","format-standard","hentry","category-blog"],"_links":{"self":[{"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/posts\/69","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/comments?post=69"}],"version-history":[{"count":0,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/posts\/69\/revisions"}],"wp:attachment":[{"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/media?parent=69"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/categories?post=69"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/benediktehinger.de\/blog\/science\/wp-json\/wp\/v2\/tags?post=69"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}