Coalesce columns based on pattern in R [duplicate]

问题

I have combined data sets in R, and each data set may use a different column name for the same data. I need to use a regular expression to identify the names of the columns I need to combine, and then run that list of column names through coalesce.

I know the proper regex expression to identify my columns, and I know how to manually write the column names into the coalesce function to combine these columns, but I do not know how to automatically coalesce columns identified with a regular expression.

sample = data.frame("PIDno" = c('a', NA, NA), "PINID" = c(NA, 'b', NA), "ParcelId" = c(NA, NA, 'c'))

PID_search = paste("sample$",grep("*PID*|*PIN*|*PARCEL*",colnames(sample),ignore.case = TRUE, value = TRUE),sep = "")

sample$PID_combine = coalesce(sample$'PIDno',
                              sample$'PINID',
                              sample$'ParcelId')

回答1:

We can use tidyverse. The selected columns are converted to character with mutate_at, then coalesce those columns in mutate

library(tidyverse)
sample %>%
    mutate_at(vars(matches("PID|PIN|Parcel")), as.character) %>% 
    mutate(new = coalesce(!!! select(., matches("PID|PIN|Parcel"))))
#    PIDno PINID ParcelId new
#1     a  <NA>     <NA>   a
#2  <NA>     b     <NA>   b
#3  <NA>  <NA>        c   c

回答2:

Here's how I would do it.

(a) Don't get "sample$PIDno" as a string - that's pretty useless. Just get the column names as strings.
(b) We'll make a separate function that calls coalesce on all the columns in a data frame. This is nice and easy to write, and then we can...
(c) Call the coalesce_df function on the subset of columns you want to coalesce. It's easy to subset a data frame based on a vector of column names, so we've simplified the first step, and added two additional simple steps to get the result.

With your sample data, the columns are all factors with different levels. Can't coalesce those as-is, so I added an lapply(..., as.character) to convert everything to character first. If your real data isn't factor class, then you can skip that step.

cols = grep("*PID*|*PIN*|*PARCEL*",colnames(sample),ignore.case = TRUE, value = TRUE)

coalesce_df = function(df) {
  do.call(coalesce, df)
}

coalesce_df(lapply(sample[cols], as.character))
# [1] "a" "b" "c"

If you want to make this work in a dplyr pipeline, I'd suggest something like this (or see akrun's answer for something a little more idiomatic).

sample %>%
  mutate_at(vars(one_of(cols)), as.character) %>%
  mutate(PID_combine = coalesce_df(.[cols]))
#   PIDno PINID ParcelId PID_combine
# 1     a  <NA>     <NA>           a
# 2  <NA>     b     <NA>           b
# 3  <NA>  <NA>        c           c

回答3:

I might be barking up the wrong tree, but the contract of the coalesce() function is that it returns the first non NA value in the parameter list, from left to right. So, if you use the following code:

sample$PID_combine = coalesce(sample$PIDno, sample$PINID, sample$ParcelId)

then the behavior would be to first return PIDno, should that value be non NA, then PINID, and following ParcelID, in that order.

The value for PID_combine would be ['a', 'b', 'c'], for the sample input data you gave in your question.

来源：https://stackoverflow.com/questions/56776187/coalesce-columns-based-on-pattern-in-r

标签

regex

dplyr

coalesce