Subset a df using partial match with multiple criteria

北城余情 提交于 2019-12-11 08:37:49

问题


This is the dataset:

company <- c("Coca-Cola Inc.", "DF, CocaCola", 
         "COCA-COLA", "PepsiCo Inc.", "Beverages Distribution")
brand  <- c("Coca-Cola Zero","N/A", "Coca-Cola", "Pepsi", "soft drink")
vol  <- c("2456","1653", "19", "2766", "167")
data   <-data.frame(company, brand, vol)
data

Which results in:

                 company             brand    vol
1         Coca-Cola Inc.    Coca-Cola Zero   2456
2           DF, CocaCola               N/A   1653
3              COCA-COLA          CocaCola     19
4           PepsiCo Inc.             Pepsi   2766
5 Beverages Distribution        soft drink    167

Let's say, this is imported volume by brand.

The task is to SUBSET the dataframe to only see observations related to Coca-Cola, not any other brand.

  • The problem is that Coca-Cola is written in many different ways.
  • Also, we know that Beverages Distribution company only imports Coca-cola, even though it is not indicated in the table above.

We need to partially match COMPANY and BRAND variables against a list of criteria (keys):

company_key <- c("coca-", "cocacola", "coca cola", "beverages distribution")
brand_key <- c("coca-", "cocacola", "coca cola")

I am struggling to execute this idea:

SUBSET data IF brand PARTIALLY MATCHES ANY key from brand_key vector OR company PARTIALLY MATCHES ANY key from company_key

So, leave only the lines in which :

(brand observation partially matches "coca-" OR "cocacola" OR "coca cola")

OR

(company observation partially matches "coca-" OR "cocacola" OR "coca cola" OR "beverages distribution")

Note: Needs to be NOT case-sensitive

The desirable output:

                 company             brand    vol
1         Coca-Cola Inc.    Coca-Cola Zero   2456
2           DF, CocaCola               N/A   1653
3              COCA-COLA          CocaCola     19
4 Beverages Distribution        soft drink    167

Any ideas? Thanks in advance :)


回答1:


Using regex and its | (or) operator. Parameter ignore.case deals with the case.

index <- grepl(paste0(company_key, collapse = "|"), data$company, ignore.case = TRUE) |
    grepl(paste0(brand_key, collapse = "|"), data$company, ignore.case = TRUE)

data[index,]  

#                 company          brand  vol
#1         Coca-Cola Inc. Coca-Cola Zero 2456
#2           DF, CocaCola            N/A 1653
#3              COCA-COLA      Coca-Cola   19
#5 Beverages Distribution     soft drink  167



回答2:


Considering that coca can be followed by either a dash or a cola preceded by optional spaces. I paste both columns together for the coca search and make a different test for Beverage Distribution

data[grepl("coca-|(\\s*cola)", paste(data[,1], data[,2]), ignore.case = T) |
       grepl("Beverages Distribution",data[,1]),]
#                  company          brand  vol
# 1         Coca-Cola Inc. Coca-Cola Zero 2456
# 2           DF, CocaCola            N/A 1653
# 3              COCA-COLA      Coca-Cola   19
# 5 Beverages Distribution     soft drink  167

If Beverage Distribution can only be a complete match you may want to change the second part to data[,1] == "Beverages Distribution"



来源:https://stackoverflow.com/questions/51443602/subset-a-df-using-partial-match-with-multiple-criteria

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!