问题
I am trying to achieve something similar to this question but with multiple values that must be replaced by NA, and in large dataset.
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = rep(1:9), var2 = rep(3:5, each = 3))
which generates this dataframe:
df
name foo var1 var2
1 a 1 1 3
2 a 2 2 3
3 a 3 3 3
4 b 4 4 4
5 b 5 5 4
6 b 6 6 4
7 c 7 7 5
8 c 8 8 5
9 c 9 9 5
I would like to replace all occurrences of, say, 3 and 4 by NA, but only in the columns that start with "var".
I know that I can use a combination of [] operators to achieve the result I want:
df[,grep("^var[:alnum:]?",colnames(df))][
df[,grep("^var[:alnum:]?",colnames(df))] == 3 |
df[,grep("^var[:alnum:]?",colnames(df))] == 4
] <- NA
df
name foo var1 var2
1 a 1 1 NA
2 a 2 2 NA
3 a 3 NA NA
4 b 4 NA NA
5 b 5 5 NA
6 b 6 6 NA
7 c 7 7 5
8 c 8 8 5
9 c 9 9 5
Now my questions are the following:
- Is there a way to do this in an efficient way, given that my actual dataset has about 100.000 lines, and 400 out of 500 variables start with "var". It seems (subjectively) slow on my computer when I use the double brackets technique.
- How would I approach the problem if
instead of 2 values (3 and 4) to be replaced by NA, I had a long
list of, say, 100 various values? Is there a way to specify multiple values with having to do a clumsy series of conditions separated by
|operator?
回答1:
You can also do this using replace:
sel <- grepl("var",names(df))
df[sel] <- lapply(df[sel], function(x) replace(x,x %in% 3:4, NA) )
df
# name foo var1 var2
#1 a 1 1 NA
#2 a 2 2 NA
#3 a 3 NA NA
#4 b 4 NA NA
#5 b 5 5 NA
#6 b 6 6 NA
#7 c 7 7 5
#8 c 8 8 5
#9 c 9 9 5
Some quick benchmarking using a million row sample of data suggests this is quicker than the other answers.
回答2:
You could also do:
col_idx <- grep("^var", names(df))
values <- c(3, 4)
m1 <- as.matrix(df[,col_idx])
m1[m1 %in% values] <- NA
df[col_idx] <- m1
df
# name foo var1 var2
#1 a 1 1 NA
#2 a 2 2 NA
#3 a 3 NA NA
#4 b 4 NA NA
#5 b 5 5 NA
#6 b 6 6 NA
#7 c 7 7 5
#8 c 8 8 5
#9 c 9 9 5
回答3:
I haven't timed this option, but I have written a function called makemeNA that is part of my GitHub-only "SOfun" package.
With that function, the approach would be something like this:
library(SOfun)
Cols <- grep("^var", names(df))
df[Cols] <- makemeNA(df[Cols], NAStrings = as.character(c(3, 4)))
df
# name foo var1 var2
# 1 a 1 1 NA
# 2 a 2 2 NA
# 3 a 3 NA NA
# 4 b 4 NA NA
# 5 b 5 5 NA
# 6 b 6 6 NA
# 7 c 7 7 5
# 8 c 8 8 5
# 9 c 9 9 5
The function uses the na.strings argument in type.convert to do the conversion to NA.
Install the package with:
library(devtools)
install_github("SOfun", "mrdwab")
(or your favorite method of installing packages from GitHub).
Here's some benchmarking. I've decided to make things interesting and replace both numeric and non-numeric values with NA to see how things compare.
Here's the sample data:
n <- 1000000
set.seed(1)
df <- data.frame(
name1 = sample(letters[1:3], n, TRUE),
name2 = sample(letters[1:3], n, TRUE),
name3 = sample(letters[1:3], n, TRUE),
var1 = sample(9, n, TRUE),
var2 = sample(5, n, TRUE),
var3 = sample(9, n, TRUE))
Here are the functions to test:
fun1 <- function() {
Cols <- names(df)
df[Cols] <- makemeNA(df[Cols], NAStrings = as.character(c(3, 4, "a")))
df
}
fun2 <- function() {
values <- c(3, 4, "a")
col_idx <- names(df)
m1 <- as.matrix(df)
m1[m1 %in% values] <- NA
df[col_idx] <- m1
df
}
fun3 <- function() {
values <- c(3, 4, "a")
col_idx <- names(df)
val_idx <- sapply(df[col_idx], "%in%", table = values)
is.na(df[col_idx]) <- val_idx
df
}
fun4 <- function() {
sel <- names(df)
df[sel] <- lapply(df[sel], function(x)
replace(x, x %in% c(3, 4, "a"), NA))
df
}
I'm breaking out fun2 and fun3. I'm not crazy about fun2 because it converts everything to the same type. I also expect fun3 to be slower.
system.time(fun2())
# user system elapsed
# 4.45 0.33 4.81
system.time(fun3())
# user system elapsed
# 34.31 0.38 34.74
So now it comes down to me and Thela...
library(microbenchmark)
microbenchmark(fun1(), fun4(), times = 50)
# Unit: seconds
# expr min lq median uq max neval
# fun1() 2.934278 2.982292 3.070784 3.091579 3.617902 50
# fun4() 2.839901 2.964274 2.981248 3.128327 3.930542 50
Dang you Thela!
回答4:
Here's an approach:
# the values that should be replaced by NA
values <- c(3, 4)
# index of columns
col_idx <- grep("^var", names(df))
# [1] 3 4
# index of values (within these columns)
val_idx <- sapply(df[col_idx], "%in%", table = values)
# var1 var2
# [1,] FALSE TRUE
# [2,] FALSE TRUE
# [3,] TRUE TRUE
# [4,] TRUE TRUE
# [5,] FALSE TRUE
# [6,] FALSE TRUE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
# replace with NA
is.na(df[col_idx]) <- val_idx
df
# name foo var1 var2
# 1 a 1 1 NA
# 2 a 2 2 NA
# 3 a 3 NA NA
# 4 b 4 NA NA
# 5 b 5 5 NA
# 6 b 6 6 NA
# 7 c 7 7 5
# 8 c 8 8 5
# 9 c 9 9 5
回答5:
I think dplyr is very well-suited for this task.
Using replace() as suggested by @thelatemail, you could do something like this:
library("dplyr")
df <- df %>%
mutate_at(vars(starts_with("var")),
funs(replace(., . %in% c(3, 4), NA)))
df
# name foo var1 var2
# 1 a 1 1 NA
# 2 a 2 2 NA
# 3 a 3 NA NA
# 4 b 4 NA NA
# 5 b 5 5 NA
# 6 b 6 6 NA
# 7 c 7 7 5
# 8 c 8 8 5
# 9 c 9 9 5
回答6:
Here is a dplyr solution:
# Define replace function
repl.f <- function(x) ifelse(x%in%c(3,4), NA,x)
library(dplyr)
cbind(select(df, -starts_with("var")),
mutate_each(select(df, starts_with("var")), funs(repl.f)))
name foo var1 var2
1 a 1 1 NA
2 a 2 2 NA
3 a 3 NA NA
4 b 4 NA NA
5 b 5 5 NA
6 b 6 6 NA
7 c 7 7 5
8 c 8 8 5
9 c 9 9 5
来源:https://stackoverflow.com/questions/25768305/r-replace-multiple-values-in-multiple-columns-of-dataframes-with-na