Separating a column in R using Regex & separate (tidyr)

问题

This is what I am looking to be able to do.
https://regex101.com/r/KchccA/1

I want to match on any characters in-between = and ) while also considering if there is a null captured group, as I want all fields to be populated per row.

Example of a row: In this example Address4, County, and Contact name are null. You can also see how some have wrong / incorrect values. Theres also some initial / ending text too I need to remove.

x <- "Please enter an UT location before booking the order.. ADDRESS_VALIDATION_FAILED (SITE_TYPE=uct) (SITE_USE_ID=1000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=2000) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=) The task is raised for line_number: 7"

However in R when I try to utilize tidyr's separate method I end up with undesirable results. Am I not escaping it right?

Here was my code for it:

df.sub <- separate(data = main.data, col = Order.Task.Text.CCW, into = c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME","CUSTOMER_NUMBER", "ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE", "ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME"), sep = "=([^\\)]+|())\\)")

Example of Results:

   SITE_TYPE    SITE_USE_ID   CUSTOMER_NAME      CUSTOMER_NUMBER       
1  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
2  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
3  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
4  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER

Final Solution

Here's my final solution for anyone curious, based on correct answer formatted for ease of viewing.

p <- proto(
 pre = function(.) .$k <- 0,
 fun = function(., x) {
 if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
 if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})
df.sub.final <- df.sub$text %>%
sub("^[^\\(]*\\(", "(", .) %>% 
sub("\\)[^\\)]*$", ")", .) %>% 
gsub("\n", "", .) %>%
gsub("=", ": ", .) %>%
gsubfn("([\\(\\)]) *", p, .) %>%
textConnection %>%
read.dcf %>%
as.data.frame(.)

回答1:

There seems to be some uncertainty as to what the valid inputs are. Below are several different answers based on different assumptions. All convert the input to dcf form (i.e. name: value) and then use read.dcf.

!) Balanced parentheses.

Transform to dcf form (i.e. name: value).

We can handle balanced parentheses with gsubfn. First create a proto object whose pre function initializes a counter k to zero and then for each match to ( or ) the function fun inputs it and increments or decrements k outputting the appropriate replacement character. See the gsubfn package vignette for more info.

Now starting from x replace the junk at the beginning, replace = with : and a space and then run gsubfn matching ( or ) followed by optional space with the proto object we defined. Finally read the transformed text using read.dcf.

library(gsubfn)
library(magrittr)

p <- proto(
 pre = function(.) .$k <- 0,
 fun = function(., x) {
  if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
  if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})

x %>%
  sub("^.*?\\(", "(", .) %>%
  gsub("=", ": ", .) %>%
  gsubfn("([\\(\\)]) *", p, .) %>%
  textConnection %>%
  read.dcf

2) Nested parentheses have no adjacent spaces

x <- "(SITE_TYPE=Site1) (SITE_USE_ID=2000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=11111) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=)"


library(magrittr)

x %>%
  paste0(" ") %>%
  sub("^.*?\\(", "", .) %>%
  gsub(" +\\(", " ", .) %>%
  gsub("=", ": ", .) %>%
  gsub("\\) ", "\n", .) %>%
  textConnection %>%
  read.dcf

giving:

     SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER ADDRESS1 ADDRESS2
[1,] "Site1"   "2000"      "cname"       "11111"         "addy1"  "addy2" 
     ADDRESS3 ADDRESS4 CITY STATE ZIP    COUNTY COUNTRY CONTACT_NAME
[1,] "addy3"  ""       "."  ""    "0000" ""     "NO"    ""

3) fixed keywords follow the outer left parentheses.

For this one the inner parentheses can be unbalanced but the outer parentheses are always followed by one of the keywords in cn.

x <- "ADDRESS_VALIDATION_FAILED (SITE_TYPE=site1) (SITE_USE_ID=200) (CUSTOMER_NAME=abc) (CUSTOMER_NUMBER=1000) (ADDRESS1=issue here (some more text) (ADDRESS2=) (ADDRESS3=) (ADDRESS4=) (CITY=city, ) (STATE=na) (ZIP=250) (COUNTY=) (COUNTRY=NA) (CONTACT_NAME=)"

cn <- c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME", "CUSTOMER_NUMBER", 
"ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE", 
"ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME")
rx <- sprintf(".(%s)", paste(cn, collapse = "|"))

x %>%
  sub("^.*?\\(", "(", .) %>%
  gsub("=", ": ", .) %>%
  gsub(rx, "\n\\1", .) %>%
  gsub("\\) *\\n", "\n", .) %>%
  sub("\\)$", "", .) %>%
  textConnection %>%
  read.dcf

giving:

     SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER
[1,] "site1"   "200"       "abc"         "1000"         
     ADDRESS1                     ADDRESS2 ADDRESS3 ADDRESS4 CITY    STATE
[1,] "issue here (some more text" ""       ""       ""       "city," "na" 
     ZIP   COUNTY COUNTRY CONTACT_NAME
[1,] "250" ""     "NA"    ""

Note

The input in reproducible form is:

来源：https://stackoverflow.com/questions/48155729/separating-a-column-in-r-using-regex-separate-tidyr

标签

regex

tidyr