R: iterative outliers detection

问题

I have a data frame (df) as follows:

V V1 V2 V3
1  A  B  32
1  A  C  33
1  A  E  43 
1  A  F  22
1  A  T  53 
1  A  N  54
1  C  T  44 
1  C  G  11
1  C  N  31
1  C  D  53
1  C  U  75
1  A  T  53 
1  A  N  54
2  C  T  42 
2  C  G  14
2  C  N  35
2  C  D  23
2  C  U  56

What want to do I to get the outliers for each combination of (V,V1) and this is to easy to achieve with the code I have.

d <- as.data.table(df)

# Add a column to keep track of row numbers
d[, c('row'):= list(seq_len(nrow(d)))]

# For each group (combination of V and V1), perform the outlier test
outliers <- d[, chisq.out.test(V3), list(V, V1)]

The main problem is that this function would return for each combination of (V,V1) just one outlier with a p-value. What I need is basically all the outliers along with their p-value of being outliers for each (V,V1) or on other words, all the candidates from V2 along with their p-value of being an outlier to (V,V1).

Any ideas how can I change my code to do that?

回答1:

I think this may work. The dropout function will do iterative looping to test for outliers. For each element you pass in, it will return 1 if the element is not an outliers, otherwise it will return the p-value < .05 for the outlier test.

library(outliers)
dropout<-function(x) {
    if(length(x)<2) return (1)
    vals <- rep.int(1, length(x))
    r <- chisq.out.test(x)
    while (r$p.value<.05 & sum(vals==1)>2) {
        if (grepl("lowest", r$alternative)) {
            d <- which.min(ifelse(vals==1,x, NA))
        } else {
            d <- which.max(ifelse(vals==1, x, NA))
        }
        vals[d] <- r$p.value
        r <- chisq.out.test(x[vals==1])
    }
    vals
}

With that helper function in place, we can now apply it to each of the sub-groups defined by V, V1. To do that, we use the ave function.

with(dd, ave(V3, V1, V2, FUN = dropout))

It appears your sample data has no outliers in any of the sub-groups given chisq.out.test definition of outliers.

And surely this iterative process is not statistically meaningful given the problem of resting for outliers in general and certainly with the multiple testing problem. Nevertheless, that discussion is for https://stats.stackexchange.com/, here we just focus on the code.

来源：https://stackoverflow.com/questions/23784843/r-iterative-outliers-detection

标签

outliers