问题
I hope I haven't missed it, but I haven't been able to find a working solution to this problem. I have a set of data frames with a shared column. These columns contain multiple and varying transcription errors, some of which are shared, others not, for multiple values. I would like replace/recode the transcription errors (bad_values) with the correct values (good_values) across all data frames.
I have tried nesting the map*()
family of functions across lists of data frames, bad_values, and good_values to do this, among other things. Here is an example:
df1 = data.frame(grp = c("a1","a.","a.",rep("b",7)), measure = rnorm(10))
df2 = data.frame(grp = c(rep("as", 3), "b2",rep("a",22)), measure = rnorm(26))
df3 = data.frame(grp = c(rep("b-",3),rep("bq",2),"a", rep("a.", 3)), measure = 1:9)
df_list = list(df1, df2, df3)
bad_values = list(c("a1","a.","as"), c("b2","b-","bq"))
good_values = list("a", "b")
dfs = map(df_list, function(x) {
x %>% mutate(grp = plyr::mapvalues(grp, bad_values, rep(good_values,length(bad_values))))
})
Which I didn't necessarily expect to work beyond a single good-bad value pair. However, I thought nesting another call to map*()
within this might work:
dfs = map(df_list, function(x) {
x %>% mutate(grp = map2(bad_values, good_values, function(x,y) {
recode(grp, bad_values = good_values)})
})
I have tried a number of other approaches, none of which have worked.
Ultimately, I would like to go from a set of data frames with errors, as here:
[[1]]
grp measure
1 a1 0.5582253
2 a. 0.3400904
3 a. -0.2200824
4 b -0.7287385
5 b -0.2128275
6 b 1.9030766
[[2]]
grp measure
1 as 1.6148772
2 as 0.1090853
3 as -1.3714180
4 b2 -0.1606979
5 a 1.1726395
6 a -0.3201150
[[3]]
grp measure
1 b- 1
2 b- 2
3 b- 3
4 bq 4
5 bq 5
6 a 6
To a list of 'fixed' data frames, as such:
[[1]]
grp measure
1 a -0.7671052
2 a 0.1781247
3 a -0.7565773
4 b -0.3606900
5 b 1.9264804
6 b 0.9506608
[[2]]
grp measure
1 a 1.45036125
2 a -2.16715639
3 a 0.80105611
4 b 0.24216723
5 a 1.33089426
6 a -0.08388404
[[3]]
grp measure
1 b 1
2 b 2
3 b 3
4 b 4
5 b 5
6 a 6
Any help would be very much appreciated
回答1:
Here is an option using tidyverse
with recode_factor
. When there are multiple elements to be changed, create a list
of key/val elements and use recode_factor
to match and change the values to new levels
library(tidyverse)
keyval <- setNames(rep(good_values, lengths(bad_values)), unlist(bad_values))
out <- map(df_list, ~ .x %>%
mutate(grp = recode_factor(grp, !!! keyval)))
-output
out
#[[1]]
# grp measure
#1 a -1.63295876
#2 a 0.03859976
#3 a -0.46541610
#4 b -0.72356671
#5 b -1.11552841
#6 b 0.99352861
#....
#[[2]]
# grp measure
#1 a 1.26536789
#2 a -0.48189740
#3 a 0.23041056
#4 b -1.01324689
#5 a -1.41586086
#6 a 0.59026463
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#6 a 6
#....
NOTE: This doesn't change the class
of the initial dataset column
str(out)
#List of 3
# $ :'data.frame': 10 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 2 2 2 2 2 2
# ..$ measure: num [1:10] -1.633 0.0386 -0.4654 -0.7236 -1.1155 ...
# $ :'data.frame': 26 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 1 1 1 1 1 1 ...
# ..$ measure: num [1:26] 1.265 -0.482 0.23 -1.013 -1.416 ...
# $ :'data.frame': 9 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 2 2 2 2 2 1 1 1 1
# ..$ measure: int [1:9] 1 2 3 4 5 6 7 8 9
Once we have a keyval pair list
, this can be also used in base R
functions
out1 <- lapply(df_list, transform, grp = unlist(keyval[grp]))
回答2:
Any reason mapping a case_when
statement wouldn't work?
library(tidyverse)
df_list %>%
map(~ mutate_if(.x, is.factor, as.character)) %>% # convert factor to character
map(~ mutate(.x, grp = case_when(grp %in% bad_values[[1]] ~ good_values[[1]],
grp %in% bad_values[[2]] ~ good_values[[2]],
TRUE ~ grp)))
I could see it working for your reprex but possibly not the greater problem.
回答3:
A base R option if you have lot of good_values
and bad_values
and it is not possible to check each one individually.
lapply(df_list, function(x) {
vec = x[['grp']]
mapply(function(p, q) vec[vec %in% p] <<- q ,bad_values, good_values)
transform(x, grp = vec)
})
#[[1]]
# grp measure
#1 a -0.648146527
#2 a -0.004722549
#3 a -0.943451194
#4 b -0.709509396
#5 b -0.719434286
#....
#[[2]]
# grp measure
#1 a 1.03131291
#2 a -0.85558910
#3 a -0.05933911
#4 b 0.67812934
#5 a 3.23854093
#6 a 1.31688645
#7 a 1.87464048
#8 a 0.90100179
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#....
Here, for every list element we extract it's grp
column and replace bad_values
with corresponding good_values
if they are found and return the corrected dataframe.
来源:https://stackoverflow.com/questions/54017250/recode-replace-multiple-values-in-a-shared-data-column-to-a-single-value-across