How to remove records from dataframe that fall outside variable-specific ranges? [R]

问题

I have a dataframe and a predictive model that I want to apply to the data. However, I want to filter out records for which the model might not apply very well. To do this, I have another dataframe that contains for every variable the minimum and maximum observed in the training data. I want to remove those records from my new data for which one or more values fall outside the specified range.

To make my question clear, this is what my data might look like:

  id   x       y     
 ---- ---- --------- 
   1    2     30521  
   2   -1      1835  
   3    5     25939  
   4    4   1000000

This is what my second table, with the mins and maxes, could look like:

  var   min    max   
 ----- ----- ------- 
  x       1       5  
  y       0   99999

In this example, I would want to flag the following records in my data: 2 (lower than the minimum for x) and 4 (higher than the max for y).

How could I easily do this in R? I have a hunch there's some clever dplyr code that would accomplish this task, but I wouldn't know what it would look like.

回答1:

You have your data like this:

df = data.frame(x=c(2,-1,5,4,7,8), y=c(30521, 1800, 25000,1000000, -5, 10))
limits = data.frame("var"=c("x", "y"), min=c(1,0), max=c(5,99999))

You can use the sweep function with operator '>' and '<' it's quite straightforward!

sweep(df, 2, limits[, 2], FUN='>') & sweep(df, 2, limits[, 3], FUN='<')
####          x     y
#### [1,]  TRUE  TRUE
#### [2,] FALSE  TRUE
#### [3,] FALSE FALSE
#### [4,]  TRUE FALSE
#### [5,] FALSE FALSE
#### [6,] FALSE  TRUE

The TRUE locations tell you which observations to keep for each variable. It should work for any number of variables

After that if you need the global flag (at least flag in one column) you can run this simple line (res being the previous output)

apply(res, 1, all)
#### [1]  TRUE FALSE FALSE FALSE FALSE FALSE

回答2:

Not very elegant, but anyway:

df <- read.table(header=T, text="  id   x       y     
   1    2     30521  
   2   -1      1835  
   3    5     25939  
   4    4   1000000 ") 
df
ranges <- read.table(header=T, text="  var   min    max   
  x       1       5  
  y       0   99999")

ranges <- ranges[match(ranges[,1], names(df)[-1]), ] # sort ranges, if necessary
matrixStats::rowAnys(
  !sapply(seq_along(df)[-1], function(x) {
    df[,x]>=ranges[x-1,2] & df[,x]<=ranges[x-1,3]
  })
) -> df$flag
df$flag
# [1] FALSE  TRUE FALSE  TRUE

回答3:

Something like that with dplyr:

library(dplyr)
df <- read.table(text = "  id   x       y     
           1    2     30521  
           2   -1      1835  
           3    5     25939  
           4    4   1000000  ", header = TRUE)


dfilte <- read.table(text = "  var   min    max
  x       1       5  
  y       0   99999  ", header = TRUE)


df  %>% mutate(flag_x = x %in% dfilte[1, -1],
               flax_y = y %in% dfilte[2, -1])

that produces this output:

  id  x       y flag_x flax_y
1  1  2   30521  FALSE  FALSE
2  2 -1    1835  FALSE  FALSE
3  3  5   25939   TRUE  FALSE
4  4  4 1000000  FALSE  FALSE

回答4:

Doesn't really understand your desired output but this would work with any range and any number of data:

> df

  id  x       y
1  1  2   30521
2  2 -1    1835
3  3  5   25939
4  4  4 1000000


#I transpose your filter data frame so its easier to work with.
> dfFilter

    x     y
min 1     0
max 5 99999

And then you can apply your filter based on the ranges in dfFilter:

#Flag original dataframe with values between the minimum x and maximum x 

   df$flag_x=ifelse(df$x > min(dfFilter$x) & df$x < max(dfFilter$x), "yes","no")


#Flag original dataframe with values between the minimum y and maximum y

   df$flag_y=ifelse(df$y > min(dfFilter$y) & df$y < max(dfFilter$y), "yes","no")

So the output look like this:

  id  x       y flag_x flag_y
1  1  2   30521    yes    yes
2  2 -1    1835     no    yes
3  3  5   25939     no    yes
4  4  4 1000000    yes    yes

Of course you can change this filters or do any mathematical operations to it so you have your desired output (like the minimum of x-2: min(dfFilter$x)-2).

Hope it works.

回答5:

I think your problem is well suited to use cut function in base R:

df$to.remove <- is.na(cut(df$x, breaks = ranges[1,][,-1])) | 
                is.na(cut(df$y, breaks = ranges[2,][,-1]))

#  id  x       y to.remove
#1  1  2   30521     FALSE
#2  2 -1    1835      TRUE
#3  3  5   25939     FALSE
#4  4  4 1000000      TRUE

is.na(...) will give you a logical vector in which values out of the specified range are TRUE. Finally, you apply the |, namely or operator, to decide which ones have to be removed.

To clean your data, you just need to do this:

df <- df[!df$to.remove,]

EDIT

I just noticed (from your comment) that your data frame contains more variables than just x and y. In which case, you can define a function, named f, and do the following for as many variables as you have in your data frame.

f <- function(x, xrange, y, yrange) {
(is.na(cut(x, breaks = xrange)) | is.na(cut(y, breaks = yrange)))}

res <- f(df$x, ranges[1,][-1], df$y, ranges[2,][-1])

data

df <- structure(list(id = 1:4, x = c(2L, -1L, 5L, 4L), y = c(30521L, 
1835L, 25939L, 1000000L)), .Names = c("id", "x", "y"), class = "data.frame", row.names = c(NA, 
-4L))

ranges <- structure(list(var = structure(1:2, .Label = c("x", "y"), class = "factor"), 
    min = c(1L, 0L), max = c(5L, 99999L)), .Names = c("var", 
"min", "max"), class = "data.frame", row.names = c(NA, -2L))

来源：https://stackoverflow.com/questions/39976775/how-to-remove-records-from-dataframe-that-fall-outside-variable-specific-ranges

标签

outliers