问题
I recently asked this question Separate contents of field And got a very quick and very simple answer.
Something I can do simply in Excel is look in a cell, find the first instance of a character and then return all the characters to the left of that.
For example
Author
Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.
I can extract Drijgers RL and Aalten P into separate columns in excel. This lets me count the number of times someone is a first author and also the last author.
How can I replicate this in R? I can count the total number of times an author has a publication from the separate rows answer above.
How would I split out the first and last authors to separate columns. That might be useful to know. In this answer Splitting column by separator from right to left in R
the number of columns is known. How do say "split this string at commas, and throw them into an unknown number of columns based on the number of names in the author list to the right of the original field"?
回答1:
data.frame(
authors = c(
"Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
"Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
"Drijgers RL, Verhey FR, Leentjens AF",
"Drijgers RL, Verhey FR",
"Drijgers RL"
),
stringsAsFactors = FALSE
) -> sample_df
cbind.data.frame( # add the columns to the original data frame after the do.cal() completes
sample_df,
do.call( # turn the list created with lapply below into a data frame
rbind.data.frame,
lapply(
strsplit(sample_df$authors, ", "), # split at comma+space
function(x) {
data.frame( # pull first/last into a data frame
first = x[1],
last = if (length(x) < 2) NA_character_ else x[length(x)], # NA last if only one author
stringsAsFactors = FALSE
)
}
)
)
)
## authors first last
## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL Aalten P.
## 2 Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL Kahler S
## 3 Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
## 4 Drijgers RL, Verhey FR Drijgers RL Verhey FR
## 5 Drijgers RL Drijgers RL <NA>
The above is terrible performance-wise. I made a stringi
match group extraction version but arg0naut's is still faster and I also optimized arg0naut's a bit since the whitespace stripping will only be needed on the left:
library(stringi)
data.frame(
authors = c(
"Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
"Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
"Drijgers RL, Verhey FR, Leentjens AF",
"Drijgers RL, Verhey FR",
"Drijgers RL"
),
stringsAsFactors = FALSE
) -> sample_df
# make some copies since we're modifying in-place now
s1 <- s2 <- sample_df
microbenchmark::microbenchmark(
stri_regex = {
s1$first <- stri_match_first_regex(s1$authors, "^([^,]+)")[,2]
s1$last <- stri_trim_left(stri_match_last_regex(s1$authors, "([^,]+)$")[,2])
s1$last <- ifelse(s1$last == s1$first, NA_character_, s1$last)
},
extract_authors = {
s2[["first"]] <- ifelse(
grepl(",", s2[["authors"]]), gsub(",.*", "", s2[["authors"]]), s2[["authors"]]
)
s2[["last"]] <- ifelse(
grepl(",", s2[["authors"]]), trimws(gsub(".*,", "", s2[["authors"]]), "left"), NA_character_
)
}
)
Results:
## Unit: microseconds
## expr min lq mean median uq max neval
## stri_regex 236.948 265.8055 331.5695 291.6610 334.1685 1002.921 100
## extract_authors 127.584 150.8490 217.1192 162.4625 227.9995 1130.913 100
identical(s1, s2)
## [1] TRUE
s1
## authors first last
## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL Aalten P.
## 2 Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL Kahler S
## 3 Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
## 4 Drijgers RL, Verhey FR Drijgers RL Verhey FR
## 5 Drijgers RL Drijgers RL <NA>
回答2:
Try this function:
extract_authors <- function(df, authors) {
df[["FirstAuthor"]] <- ifelse(
grepl(",", df[[authors]]), trimws(gsub(",.*", "", df[[authors]])), df[[authors]]
)
df[["LastAuthor"]] <- ifelse(
grepl(",", df[[authors]]), trimws(gsub(".*,", "", df[[authors]])), "No last author"
)
return(df)
}
Works with the other example from this topic:
data.frame(
authors = c(
"Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
"Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
"Drijgers RL, Verhey FR, Leentjens AF",
"Drijgers RL, Verhey FR",
"Drijgers RL"
),
stringsAsFactors = FALSE
) -> sample_df
You can call it like:
extract_authors(df, "authors")
In the output, you get 2 new columns, FirstAuthor
and LastAuthor
:
authors FirstAuthor LastAuthor
1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL Aalten P.
2 Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL Kahler S
3 Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
4 Drijgers RL, Verhey FR Drijgers RL Verhey FR
5 Drijgers RL Drijgers RL No last author
来源:https://stackoverflow.com/questions/53318374/separate-variable-in-field-by-character