R: iterative outliers detection

妖精的绣舞 提交于 2019-12-25 02:58:44

问题


I have a data frame (df) as follows:

V V1 V2 V3
1  A  B  32
1  A  C  33
1  A  E  43 
1  A  F  22
1  A  T  53 
1  A  N  54
1  C  T  44 
1  C  G  11
1  C  N  31
1  C  D  53
1  C  U  75
1  A  T  53 
1  A  N  54
2  C  T  42 
2  C  G  14
2  C  N  35
2  C  D  23
2  C  U  56

What want to do I to get the outliers for each combination of (V,V1) and this is to easy to achieve with the code I have.

d <- as.data.table(df)

# Add a column to keep track of row numbers
d[, c('row'):= list(seq_len(nrow(d)))]

# For each group (combination of V and V1), perform the outlier test
outliers <- d[, chisq.out.test(V3), list(V, V1)]

The main problem is that this function would return for each combination of (V,V1) just one outlier with a p-value. What I need is basically all the outliers along with their p-value of being outliers for each (V,V1) or on other words, all the candidates from V2 along with their p-value of being an outlier to (V,V1).

Any ideas how can I change my code to do that?


回答1:


I think this may work. The dropout function will do iterative looping to test for outliers. For each element you pass in, it will return 1 if the element is not an outliers, otherwise it will return the p-value < .05 for the outlier test.

library(outliers)
dropout<-function(x) {
    if(length(x)<2) return (1)
    vals <- rep.int(1, length(x))
    r <- chisq.out.test(x)
    while (r$p.value<.05 & sum(vals==1)>2) {
        if (grepl("lowest", r$alternative)) {
            d <- which.min(ifelse(vals==1,x, NA))
        } else {
            d <- which.max(ifelse(vals==1, x, NA))
        }
        vals[d] <- r$p.value
        r <- chisq.out.test(x[vals==1])
    }
    vals
}

With that helper function in place, we can now apply it to each of the sub-groups defined by V, V1. To do that, we use the ave function.

with(dd, ave(V3, V1, V2, FUN = dropout))

It appears your sample data has no outliers in any of the sub-groups given chisq.out.test definition of outliers.

And surely this iterative process is not statistically meaningful given the problem of resting for outliers in general and certainly with the multiple testing problem. Nevertheless, that discussion is for https://stats.stackexchange.com/, here we just focus on the code.



来源:https://stackoverflow.com/questions/23784843/r-iterative-outliers-detection

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!