string-matching | 易学教程

How to extract all matching patterns (words in a string) in a dataframe column?

阅读更多关于 How to extract all matching patterns (words in a string) in a dataframe column?

问题 I have two dataframes. one ( txt.df ) has a column with a text I want to extract phrases from ( text ). The other ( wrd.df ) has a column with the phrases ( phrase ). both are big dataframes with complex texts and strings but lets say: txt.df <- data.frame(id = c(1, 2, 3, 4, 5), text = c("they love cats and dogs", "he is drinking juice", "the child is having a nap on the bed", "they jump on the bed and break it", "the cat is sleeping on the bed")) wrd.df <- data.frame(label = c('a', 'b', 'c',

How to extract all matching patterns (words in a string) in a dataframe column?

阅读更多关于 How to extract all matching patterns (words in a string) in a dataframe column?

How to extract all matching patterns (words in a string) in a dataframe column?

阅读更多关于 How to extract all matching patterns (words in a string) in a dataframe column?

R: Regex_Join/Fuzzy_Join - Join Inexact Strings in Different Word Orders

阅读更多关于 R: Regex_Join/Fuzzy_Join - Join Inexact Strings in Different Word Orders

问题 df1 df2 df3 library(dplyr) library(fuzzyjoin) df1 <- tibble(a =c("Apple Pear Orange", "Sock Shoe Hat", "Cat Mouse Dog")) df2 <- tibble(b =c("Kiwi Lemon Apple", "Shirt Sock Glove", "Mouse Dog"), c = c("Fruit", "Clothes", "Animals")) # Appends 'Animals' df3 <- regex_left_join(df1,df2, c("a" = "b")) # Appends Nothing df3 <- stringdist_left_join(df1, df2, by = c("a" = "b"), max_dist = 3, method = "lcs") I want to append column c of df2 to df1 using the strings, 'Apple', 'Sock' and 'Mouse Dog'. I

Regex in R: finding exact number

阅读更多关于 Regex in R: finding exact number

问题 This is in R grep("AB22", c("AB22" ,"AB22","AB22" ,"AB22+3" ,"AB226AEM+1","AB22AEM+2") , value=T) gives all of them: "AB22","AB22", "AB22" ,"AB22+3" ,"AB226AEM+1" ,"AB22AEM+2" but, I want only "AB22","AB22","AB22" ,"AB22+3" ,AB22AEM+2" i.e. all the entries containing AB22 and not AB226 ot 2265...etc. Thanks 回答1: That's a job for word boundary anchors and/or a negative lookahead assertion: grep("\\bAB22(?!\\d)", c("AB22" ,"AB22","AB22" ,"AB22+3" ,"AB226AEM+1","AB22AEM+2") , value=T, perl=TRUE)

String matching in VBA using a predefined function

阅读更多关于 String matching in VBA using a predefined function

问题 I have the following data which I want to match and after going through several techniques, the most favorable seems to be Levenshtein distance method – would you agree with this approach based on the below data or would you recommend some other method that would be able to match the following better in high volumes? The example of the data can be seen below: **Column1** **Column2** Modra Digest (DC) Oldstewart2 South West Local /Sunday Times (new) Oldstewart OldStewart political print Saigon

Partial String Match in R using the %in% operator?

阅读更多关于 Partial String Match in R using the %in% operator?

问题 I'm curious to know if it is possible to do partial string matches using the %in% operator in R. I know that there are many ways to use stringr, etc. to find partial string matches, but my current code works easier using the %in% operator. For instance, imagine this vector: x <- c("Withdrawn", "withdrawn", "5-Withdrawn", "2-WITHDRAWN", "withdrawnn") I want each of these to be TRUE because the string contains "Withdrawn", but only the first is TRUE: x %in% c("Withdrawn") [1] TRUE FALSE FALSE

Properties of pmatch function

阅读更多关于 Properties of pmatch function

问题 I don't understand the behavior of the built-in function pmatch (partial string matching). The description provides the following example: pmatch("m", c("mean", "median", "mode")) # returns NA instead of 1,2,3 but using: pmatch("m", "mean") # returns 1, as I would have expected. Could anybody explain to me this behavior? 回答1: As per the documentation: nomatch : the value to be returned at non-matching or multiply partially matching positions. Note that it is coerced to integer. The nomatch

Properties of pmatch function

阅读更多关于 Properties of pmatch function

Algorithmic way to search a list of tuples for a matching substring?

阅读更多关于 Algorithmic way to search a list of tuples for a matching substring?

问题 I have a list of tuples, about 100k entries. Each tuple consists of an id and a string, my goal is to list the ids of the tuples, whose strings contain a substring from a given list of substrings. My current solution is through set comprehension, ids can repeat. tuples = [(id1, 'cheese trees'), (id2, 'freezy breeze'),...] vals = ['cheese', 'flees'] ids = {i[0] for i in tuples if any(val in i[1] for val in vals)} output: {id1} Is there an algorithm that would allow doing that quicker? I'm