问题
This is the dataset:
company <- c("Coca-Cola Inc.", "DF, CocaCola",
"COCA-COLA", "PepsiCo Inc.", "Beverages Distribution")
brand <- c("Coca-Cola Zero","N/A", "Coca-Cola", "Pepsi", "soft drink")
vol <- c("2456","1653", "19", "2766", "167")
data <-data.frame(company, brand, vol)
data
Which results in:
company brand vol
1 Coca-Cola Inc. Coca-Cola Zero 2456
2 DF, CocaCola N/A 1653
3 COCA-COLA CocaCola 19
4 PepsiCo Inc. Pepsi 2766
5 Beverages Distribution soft drink 167
Let's say, this is imported volume by brand.
The task is to SUBSET the dataframe to only see observations related to Coca-Cola, not any other brand.
- The problem is that Coca-Cola is written in many different ways.
- Also, we know that Beverages Distribution company only imports Coca-cola, even though it is not indicated in the table above.
We need to partially match COMPANY and BRAND variables against a list of criteria (keys):
company_key <- c("coca-", "cocacola", "coca cola", "beverages distribution")
brand_key <- c("coca-", "cocacola", "coca cola")
I am struggling to execute this idea:
SUBSET data IF brand PARTIALLY MATCHES ANY key from brand_key vector OR company PARTIALLY MATCHES ANY key from company_key
So, leave only the lines in which :
(brand observation partially matches "coca-" OR "cocacola" OR "coca cola")
OR
(company observation partially matches "coca-" OR "cocacola" OR "coca cola" OR "beverages distribution")
Note: Needs to be NOT case-sensitive
The desirable output:
company brand vol
1 Coca-Cola Inc. Coca-Cola Zero 2456
2 DF, CocaCola N/A 1653
3 COCA-COLA CocaCola 19
4 Beverages Distribution soft drink 167
Any ideas? Thanks in advance :)
回答1:
Using regex and its |
(or) operator. Parameter ignore.case
deals with the case.
index <- grepl(paste0(company_key, collapse = "|"), data$company, ignore.case = TRUE) |
grepl(paste0(brand_key, collapse = "|"), data$company, ignore.case = TRUE)
data[index,]
# company brand vol
#1 Coca-Cola Inc. Coca-Cola Zero 2456
#2 DF, CocaCola N/A 1653
#3 COCA-COLA Coca-Cola 19
#5 Beverages Distribution soft drink 167
回答2:
Considering that coca
can be followed by either a dash or a cola preceded by optional spaces. I paste both columns together for the coca
search and make a different test for Beverage Distribution
data[grepl("coca-|(\\s*cola)", paste(data[,1], data[,2]), ignore.case = T) |
grepl("Beverages Distribution",data[,1]),]
# company brand vol
# 1 Coca-Cola Inc. Coca-Cola Zero 2456
# 2 DF, CocaCola N/A 1653
# 3 COCA-COLA Coca-Cola 19
# 5 Beverages Distribution soft drink 167
If Beverage Distribution
can only be a complete match you may want to change the second part to data[,1] == "Beverages Distribution"
来源:https://stackoverflow.com/questions/51443602/subset-a-df-using-partial-match-with-multiple-criteria