问题
I have combined data sets in R, and each data set may use a different column name for the same data. I need to use a regular expression to identify the names of the columns I need to combine, and then run that list of column names through coalesce.
I know the proper regex expression to identify my columns, and I know how to manually write the column names into the coalesce function to combine these columns, but I do not know how to automatically coalesce columns identified with a regular expression.
sample = data.frame("PIDno" = c('a', NA, NA), "PINID" = c(NA, 'b', NA), "ParcelId" = c(NA, NA, 'c'))
PID_search = paste("sample$",grep("*PID*|*PIN*|*PARCEL*",colnames(sample),ignore.case = TRUE, value = TRUE),sep = "")
sample$PID_combine = coalesce(sample$'PIDno',
sample$'PINID',
sample$'ParcelId')
回答1:
We can use tidyverse
. The selected columns are converted to character
with mutate_at
, then coalesce
those columns in mutate
library(tidyverse)
sample %>%
mutate_at(vars(matches("PID|PIN|Parcel")), as.character) %>%
mutate(new = coalesce(!!! select(., matches("PID|PIN|Parcel"))))
# PIDno PINID ParcelId new
#1 a <NA> <NA> a
#2 <NA> b <NA> b
#3 <NA> <NA> c c
回答2:
Here's how I would do it.
- (a) Don't get
"sample$PIDno"
as a string - that's pretty useless. Just get the column names as strings. - (b) We'll make a separate function that calls
coalesce
on all the columns in a data frame. This is nice and easy to write, and then we can... - (c) Call the
coalesce_df
function on the subset of columns you want to coalesce. It's easy to subset a data frame based on a vector of column names, so we've simplified the first step, and added two additional simple steps to get the result.
With your sample data, the columns are all factor
s with different levels. Can't coalesce those as-is, so I added an lapply(..., as.character)
to convert everything to character first. If your real data isn't factor
class, then you can skip that step.
cols = grep("*PID*|*PIN*|*PARCEL*",colnames(sample),ignore.case = TRUE, value = TRUE)
coalesce_df = function(df) {
do.call(coalesce, df)
}
coalesce_df(lapply(sample[cols], as.character))
# [1] "a" "b" "c"
If you want to make this work in a dplyr
pipeline, I'd suggest something like this (or see akrun's answer for something a little more idiomatic).
sample %>%
mutate_at(vars(one_of(cols)), as.character) %>%
mutate(PID_combine = coalesce_df(.[cols]))
# PIDno PINID ParcelId PID_combine
# 1 a <NA> <NA> a
# 2 <NA> b <NA> b
# 3 <NA> <NA> c c
回答3:
I might be barking up the wrong tree, but the contract of the coalesce()
function is that it returns the first non NA
value in the parameter list, from left to right. So, if you use the following code:
sample$PID_combine = coalesce(sample$PIDno, sample$PINID, sample$ParcelId)
then the behavior would be to first return PIDno
, should that value be non NA
, then PINID
, and following ParcelID
, in that order.
The value for PID_combine
would be ['a', 'b', 'c']
, for the sample input data you gave in your question.
来源:https://stackoverflow.com/questions/56776187/coalesce-columns-based-on-pattern-in-r