问题
I have a dataframe and a predictive model that I want to apply to the data. However, I want to filter out records for which the model might not apply very well. To do this, I have another dataframe that contains for every variable the minimum and maximum observed in the training data. I want to remove those records from my new data for which one or more values fall outside the specified range.
To make my question clear, this is what my data might look like:
id x y
---- ---- ---------
1 2 30521
2 -1 1835
3 5 25939
4 4 1000000
This is what my second table, with the mins and maxes, could look like:
var min max
----- ----- -------
x 1 5
y 0 99999
In this example, I would want to flag the following records in my data: 2 (lower than the minimum for x) and 4 (higher than the max for y).
How could I easily do this in R? I have a hunch there's some clever dplyr
code that would accomplish this task, but I wouldn't know what it would look like.
回答1:
You have your data like this:
df = data.frame(x=c(2,-1,5,4,7,8), y=c(30521, 1800, 25000,1000000, -5, 10))
limits = data.frame("var"=c("x", "y"), min=c(1,0), max=c(5,99999))
You can use the sweep
function with operator '>'
and '<'
it's quite straightforward!
sweep(df, 2, limits[, 2], FUN='>') & sweep(df, 2, limits[, 3], FUN='<')
#### x y
#### [1,] TRUE TRUE
#### [2,] FALSE TRUE
#### [3,] FALSE FALSE
#### [4,] TRUE FALSE
#### [5,] FALSE FALSE
#### [6,] FALSE TRUE
The TRUE locations tell you which observations to keep for each variable. It should work for any number of variables
After that if you need the global flag (at least flag in one column) you can run this simple line (res being the previous output)
apply(res, 1, all)
#### [1] TRUE FALSE FALSE FALSE FALSE FALSE
回答2:
Not very elegant, but anyway:
df <- read.table(header=T, text=" id x y
1 2 30521
2 -1 1835
3 5 25939
4 4 1000000 ")
df
ranges <- read.table(header=T, text=" var min max
x 1 5
y 0 99999")
ranges <- ranges[match(ranges[,1], names(df)[-1]), ] # sort ranges, if necessary
matrixStats::rowAnys(
!sapply(seq_along(df)[-1], function(x) {
df[,x]>=ranges[x-1,2] & df[,x]<=ranges[x-1,3]
})
) -> df$flag
df$flag
# [1] FALSE TRUE FALSE TRUE
回答3:
Something like that with dplyr:
library(dplyr)
df <- read.table(text = " id x y
1 2 30521
2 -1 1835
3 5 25939
4 4 1000000 ", header = TRUE)
dfilte <- read.table(text = " var min max
x 1 5
y 0 99999 ", header = TRUE)
df %>% mutate(flag_x = x %in% dfilte[1, -1],
flax_y = y %in% dfilte[2, -1])
that produces this output:
id x y flag_x flax_y
1 1 2 30521 FALSE FALSE
2 2 -1 1835 FALSE FALSE
3 3 5 25939 TRUE FALSE
4 4 4 1000000 FALSE FALSE
回答4:
Doesn't really understand your desired output but this would work with any range and any number of data:
> df
id x y
1 1 2 30521
2 2 -1 1835
3 3 5 25939
4 4 4 1000000
#I transpose your filter data frame so its easier to work with.
> dfFilter
x y
min 1 0
max 5 99999
And then you can apply your filter based on the ranges in dfFilter
:
#Flag original dataframe with values between the minimum x and maximum x
df$flag_x=ifelse(df$x > min(dfFilter$x) & df$x < max(dfFilter$x), "yes","no")
#Flag original dataframe with values between the minimum y and maximum y
df$flag_y=ifelse(df$y > min(dfFilter$y) & df$y < max(dfFilter$y), "yes","no")
So the output look like this:
id x y flag_x flag_y
1 1 2 30521 yes yes
2 2 -1 1835 no yes
3 3 5 25939 no yes
4 4 4 1000000 yes yes
Of course you can change this filters or do any mathematical operations to it so you have your desired output (like the minimum of x-2: min(dfFilter$x)-2
).
Hope it works.
回答5:
I think your problem is well suited to use cut
function in base R:
df$to.remove <- is.na(cut(df$x, breaks = ranges[1,][,-1])) |
is.na(cut(df$y, breaks = ranges[2,][,-1]))
# id x y to.remove
#1 1 2 30521 FALSE
#2 2 -1 1835 TRUE
#3 3 5 25939 FALSE
#4 4 4 1000000 TRUE
is.na(...)
will give you a logical vector in which values out of the specified range are TRUE
. Finally, you apply the |
, namely or
operator, to decide which ones have to be removed.
To clean your data, you just need to do this:
df <- df[!df$to.remove,]
EDIT
I just noticed (from your comment) that your data frame contains more variables than just x
and y
. In which case, you can define a function, named f
, and do the following for as many variables as you have in your data frame.
f <- function(x, xrange, y, yrange) {
(is.na(cut(x, breaks = xrange)) | is.na(cut(y, breaks = yrange)))}
res <- f(df$x, ranges[1,][-1], df$y, ranges[2,][-1])
data
df <- structure(list(id = 1:4, x = c(2L, -1L, 5L, 4L), y = c(30521L,
1835L, 25939L, 1000000L)), .Names = c("id", "x", "y"), class = "data.frame", row.names = c(NA,
-4L))
ranges <- structure(list(var = structure(1:2, .Label = c("x", "y"), class = "factor"),
min = c(1L, 0L), max = c(5L, 99999L)), .Names = c("var",
"min", "max"), class = "data.frame", row.names = c(NA, -2L))
来源:https://stackoverflow.com/questions/39976775/how-to-remove-records-from-dataframe-that-fall-outside-variable-specific-ranges