Separate variable in field by character

问题

I recently asked this question Separate contents of field And got a very quick and very simple answer.

Something I can do simply in Excel is look in a cell, find the first instance of a character and then return all the characters to the left of that.

For example

Author

Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.

I can extract Drijgers RL and Aalten P into separate columns in excel. This lets me count the number of times someone is a first author and also the last author.

How can I replicate this in R? I can count the total number of times an author has a publication from the separate rows answer above.

How would I split out the first and last authors to separate columns. That might be useful to know. In this answer Splitting column by separator from right to left in R

the number of columns is known. How do say "split this string at commas, and throw them into an unknown number of columns based on the number of names in the author list to the right of the original field"?

回答1:

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

cbind.data.frame( # add the columns to the original data frame after the do.cal() completes
  sample_df,
  do.call( # turn the list created with lapply below into a data frame
    rbind.data.frame, 
    lapply(
      strsplit(sample_df$authors, ", "), # split at comma+space
      function(x) {
        data.frame( # pull first/last into a data frame
          first = x[1],
          last = if (length(x) < 2) NA_character_ else x[length(x)], # NA last if only one author
          stringsAsFactors = FALSE
        )
      }
    )
  )
)
##                                                     authors       first         last
## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL    Aalten P.
## 2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL     Kahler S
## 3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
## 4                                    Drijgers RL, Verhey FR Drijgers RL    Verhey FR
## 5                                               Drijgers RL Drijgers RL         <NA>

The above is terrible performance-wise. I made a stringi match group extraction version but arg0naut's is still faster and I also optimized arg0naut's a bit since the whitespace stripping will only be needed on the left:

library(stringi)

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

# make some copies since we're modifying in-place now
s1 <- s2 <- sample_df

microbenchmark::microbenchmark(

  stri_regex = {
    s1$first <-  stri_match_first_regex(s1$authors, "^([^,]+)")[,2]
    s1$last <- stri_trim_left(stri_match_last_regex(s1$authors, "([^,]+)$")[,2])
    s1$last <- ifelse(s1$last == s1$first, NA_character_, s1$last)
  },

  extract_authors = {
    s2[["first"]] <- ifelse(
      grepl(",", s2[["authors"]]), gsub(",.*", "", s2[["authors"]]), s2[["authors"]]
    )
    s2[["last"]] <- ifelse(
      grepl(",", s2[["authors"]]), trimws(gsub(".*,", "", s2[["authors"]]), "left"), NA_character_
    )

  }

)

Results:

## Unit: microseconds
##             expr     min       lq     mean   median       uq      max neval
##       stri_regex 236.948 265.8055 331.5695 291.6610 334.1685 1002.921   100
##  extract_authors 127.584 150.8490 217.1192 162.4625 227.9995 1130.913   100

identical(s1, s2)
## [1] TRUE

s1
##                                                     authors       first         last
## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL    Aalten P.
## 2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL     Kahler S
## 3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
## 4                                    Drijgers RL, Verhey FR Drijgers RL    Verhey FR
## 5                                               Drijgers RL Drijgers RL         <NA>

回答2:

Try this function:

extract_authors <- function(df, authors) {

  df[["FirstAuthor"]] <- ifelse(
    grepl(",", df[[authors]]), trimws(gsub(",.*", "", df[[authors]])), df[[authors]]
  )


  df[["LastAuthor"]] <- ifelse(
    grepl(",", df[[authors]]), trimws(gsub(".*,", "", df[[authors]])), "No last author"
  )

  return(df)

}

Works with the other example from this topic:

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

You can call it like:

extract_authors(df, "authors")

In the output, you get 2 new columns, FirstAuthor and LastAuthor:

                                                    authors FirstAuthor     LastAuthor
1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL      Aalten P.
2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL       Kahler S
3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL   Leentjens AF
4                                    Drijgers RL, Verhey FR Drijgers RL      Verhey FR
5                                               Drijgers RL Drijgers RL No last author

来源：https://stackoverflow.com/questions/53318374/separate-variable-in-field-by-character

标签

regex

tidyr