Removing univariate outliers from data frame (+-3 SDs)

拈花ヽ惹草 提交于 2019-11-30 22:53:25
> dat <- data.frame(
                    var1=sample(letters[1:2],10,replace=TRUE),
                    var2=c(1,2,3,1,2,3,102,3,1,2)
                   )
> dat
   var1 var2
1     b    1
2     a    2
3     a    3
4     a    1
5     b    2
6     b    3
7     a  102 #outlier
8     b    3
9     b    1
10    a    2

Now only return those rows which are not (!) greater than 2 absolute sd's from the mean of the variable in question. Obviously change 2 to however many sd's you want to be the cutoff.

> dat[!(abs(dat$var2 - mean(dat$var2))/sd(dat$var2)) > 2,]
   var1 var2
1     b    1
2     a    2
3     a    3
4     a    1
5     b    2
6     b    3 # no outlier
8     b    3 # between here
9     b    1
10    a    2

Or more short-hand using the scale function:

dat[!abs(scale(dat$var2)) > 2,]

   var1 var2
1     b    1
2     a    2
3     a    3
4     a    1
5     b    2
6     b    3
8     b    3
9     b    1
10    a    2

edit

This can be extended to looking within groups using by

do.call(rbind,by(dat,dat$var1,function(x) x[!abs(scale(x$var2)) > 2,] ))

This assumes dat$var1 is your variable defining the group each row belongs to.

I use the winsorize() function in the robustHD package for this task. Here is its example:

R> example(winsorize)

winsrzR> ## generate data
winsrzR> set.seed(1234)     # for reproducibility

winsrzR> x <- rnorm(10)     # standard normal

winsrzR> x[1] <- x[1] * 10  # introduce outlier

winsrzR> ## winsorize data
winsrzR> x
 [1] -12.070657   0.277429   1.084441  -2.345698   0.429125   0.506056  
 [7]  -0.574740  -0.546632  -0.564452  -0.890038

winsrzR> winsorize(x)
 [1] -3.250372  0.277429  1.084441 -2.345698  0.429125  0.506056 
 [7] -0.574740 -0.546632 -0.564452 -0.890038

winsrzR>

This defaults to median +/- 2 mad, but you can set the parameters for mean +/- 3 sd.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!