问题
I have a dataframe that looks something like this:
Row ID1 ID2 Colors1 Colors2
1 1 2 Green, Blue Red, Orange
2 1 3 Green, Orange Orange, Red
I would like to create a calculation that tells me the count of colors in common between Colors1 and Colors2. The desired result is the following:
Row ID1 ID2 Colors1 Colors2 Common
1 1 2 Green, Blue, Purple Green, Purple 2 #Green, Purple
2 1 3 Green, Orange Orange, Red 1 #Orange
回答1:
You can use:
col1 <- strsplit(df$Colors1, ", ")
col2 <- strsplit(df$Colors2, ", ")
df$Common <- sapply(seq_len(nrow(df)), function(x) length(intersect(col1[[x]], col2[[x]])))
Example
df <- data.frame(Colors1 = c('Green, Blue', 'Green, Blue, Purple'), Colors2 = c('Green, Purple', 'Orange, Red'), stringsAsFactors = FALSE)
col1 <- strsplit(df$Colors1, ", ")
col2 <- strsplit(df$Colors2, ", ")
df$Common <- sapply(seq_len(nrow(df)), function(x) length(intersect(col1[[x]], col2[[x]])))
df
# Colors1 Colors2 Common
# 1 Green, Blue Green, Purple 1
# 2 Green, Blue, Purple Orange, Red 0
回答2:
An alternative approach is to treat the first column as a regular expression to search in the second column and make use of the "stringi" package to facilitate the vectorized searching of the patterns.
df <- structure(list(Colors1 = c("Green, Blue, Purple", "Green, Blue",
"Green, Blue, Purple"), Colors2 = c("Green, Purple", "Green, Purple",
"Orange, Red")), .Names = c("Colors1", "Colors2"), row.names = c("2",
"21", "3"), class = "data.frame")
df
# Colors1 Colors2
# 2 Green, Blue, Purple Green, Purple
# 21 Green, Blue Green, Purple
# 3 Green, Blue, Purple Orange, Red
library(stringi)
stri_extract_all_regex(df$Colors2, gsub(", ", "|", df$Colors1))
# [[1]]
# [1] "Green" "Purple"
#
# [[2]]
# [1] "Green"
#
# [[3]]
# [1] NA
stri_count_regex(df$Colors2, gsub(", ", "|", df$Colors1))
# [1] 2 1 0
Basically, what I've done there is use gsub
to convert the "Colors1" column to a regular expression search pattern that looks like "Green|Blue|Purple"
instead of "Green, Blue, Purple"
and used that as the search pattern in each of the "stringi" functions I demonstrated above.
来源:https://stackoverflow.com/questions/22725023/number-of-matches-between-two-comma-separated-factors-in-a-data-frame