string-matching

How to extract all matching patterns (words in a string) in a dataframe column?

生来就可爱ヽ(ⅴ<●) 提交于 2021-02-11 06:10:45
问题 I have two dataframes. one ( txt.df ) has a column with a text I want to extract phrases from ( text ). The other ( wrd.df ) has a column with the phrases ( phrase ). both are big dataframes with complex texts and strings but lets say: txt.df <- data.frame(id = c(1, 2, 3, 4, 5), text = c("they love cats and dogs", "he is drinking juice", "the child is having a nap on the bed", "they jump on the bed and break it", "the cat is sleeping on the bed")) wrd.df <- data.frame(label = c('a', 'b', 'c',

How to extract all matching patterns (words in a string) in a dataframe column?

主宰稳场 提交于 2021-02-11 06:09:29
问题 I have two dataframes. one ( txt.df ) has a column with a text I want to extract phrases from ( text ). The other ( wrd.df ) has a column with the phrases ( phrase ). both are big dataframes with complex texts and strings but lets say: txt.df <- data.frame(id = c(1, 2, 3, 4, 5), text = c("they love cats and dogs", "he is drinking juice", "the child is having a nap on the bed", "they jump on the bed and break it", "the cat is sleeping on the bed")) wrd.df <- data.frame(label = c('a', 'b', 'c',

How to extract all matching patterns (words in a string) in a dataframe column?

妖精的绣舞 提交于 2021-02-11 06:05:49
问题 I have two dataframes. one ( txt.df ) has a column with a text I want to extract phrases from ( text ). The other ( wrd.df ) has a column with the phrases ( phrase ). both are big dataframes with complex texts and strings but lets say: txt.df <- data.frame(id = c(1, 2, 3, 4, 5), text = c("they love cats and dogs", "he is drinking juice", "the child is having a nap on the bed", "they jump on the bed and break it", "the cat is sleeping on the bed")) wrd.df <- data.frame(label = c('a', 'b', 'c',

R: Regex_Join/Fuzzy_Join - Join Inexact Strings in Different Word Orders

非 Y 不嫁゛ 提交于 2021-02-08 04:03:45
问题 df1 df2 df3 library(dplyr) library(fuzzyjoin) df1 <- tibble(a =c("Apple Pear Orange", "Sock Shoe Hat", "Cat Mouse Dog")) df2 <- tibble(b =c("Kiwi Lemon Apple", "Shirt Sock Glove", "Mouse Dog"), c = c("Fruit", "Clothes", "Animals")) # Appends 'Animals' df3 <- regex_left_join(df1,df2, c("a" = "b")) # Appends Nothing df3 <- stringdist_left_join(df1, df2, by = c("a" = "b"), max_dist = 3, method = "lcs") I want to append column c of df2 to df1 using the strings, 'Apple', 'Sock' and 'Mouse Dog'. I

Regex in R: finding exact number

耗尽温柔 提交于 2021-02-05 06:03:22
问题 This is in R grep("AB22", c("AB22" ,"AB22","AB22" ,"AB22+3" ,"AB226AEM+1","AB22AEM+2") , value=T) gives all of them: "AB22","AB22", "AB22" ,"AB22+3" ,"AB226AEM+1" ,"AB22AEM+2" but, I want only "AB22","AB22","AB22" ,"AB22+3" ,AB22AEM+2" i.e. all the entries containing AB22 and not AB226 ot 2265...etc. Thanks 回答1: That's a job for word boundary anchors and/or a negative lookahead assertion: grep("\\bAB22(?!\\d)", c("AB22" ,"AB22","AB22" ,"AB22+3" ,"AB226AEM+1","AB22AEM+2") , value=T, perl=TRUE)

String matching in VBA using a predefined function

一曲冷凌霜 提交于 2021-01-29 08:41:20
问题 I have the following data which I want to match and after going through several techniques, the most favorable seems to be Levenshtein distance method – would you agree with this approach based on the below data or would you recommend some other method that would be able to match the following better in high volumes? The example of the data can be seen below: **Column1** **Column2** Modra Digest (DC) Oldstewart2 South West Local /Sunday Times (new) Oldstewart OldStewart political print Saigon

Partial String Match in R using the %in% operator?

和自甴很熟 提交于 2021-01-28 12:20:15
问题 I'm curious to know if it is possible to do partial string matches using the %in% operator in R. I know that there are many ways to use stringr, etc. to find partial string matches, but my current code works easier using the %in% operator. For instance, imagine this vector: x <- c("Withdrawn", "withdrawn", "5-Withdrawn", "2-WITHDRAWN", "withdrawnn") I want each of these to be TRUE because the string contains "Withdrawn", but only the first is TRUE: x %in% c("Withdrawn") [1] TRUE FALSE FALSE

Properties of pmatch function

大城市里の小女人 提交于 2021-01-27 14:59:15
问题 I don't understand the behavior of the built-in function pmatch (partial string matching). The description provides the following example: pmatch("m", c("mean", "median", "mode")) # returns NA instead of 1,2,3 but using: pmatch("m", "mean") # returns 1, as I would have expected. Could anybody explain to me this behavior? 回答1: As per the documentation: nomatch : the value to be returned at non-matching or multiply partially matching positions. Note that it is coerced to integer. The nomatch

Properties of pmatch function

百般思念 提交于 2021-01-27 14:34:00
问题 I don't understand the behavior of the built-in function pmatch (partial string matching). The description provides the following example: pmatch("m", c("mean", "median", "mode")) # returns NA instead of 1,2,3 but using: pmatch("m", "mean") # returns 1, as I would have expected. Could anybody explain to me this behavior? 回答1: As per the documentation: nomatch : the value to be returned at non-matching or multiply partially matching positions. Note that it is coerced to integer. The nomatch

Algorithmic way to search a list of tuples for a matching substring?

断了今生、忘了曾经 提交于 2021-01-27 12:43:31
问题 I have a list of tuples, about 100k entries. Each tuple consists of an id and a string, my goal is to list the ids of the tuples, whose strings contain a substring from a given list of substrings. My current solution is through set comprehension, ids can repeat. tuples = [(id1, 'cheese trees'), (id2, 'freezy breeze'),...] vals = ['cheese', 'flees'] ids = {i[0] for i in tuples if any(val in i[1] for val in vals)} output: {id1} Is there an algorithm that would allow doing that quicker? I'm