string-matching | 易学教程

Finding how similar two strings are

阅读更多关于 Finding how similar two strings are

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity". Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have. This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow. I know this exists, i've seen it before, but I can't remember its name. Edit: Thanks for pointing out Levenshtein and Hamming. Now, which one should I implement? They

Find matches of a vector of strings in another vector of strings

阅读更多关于 Find matches of a vector of strings in another vector of strings

问题 I'm trying to create a subset of a data frame of news articles that mention at least one element of a set of keywords or phrases. # Sample data frame of articles articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in

String similarity in PHP: levenshtein like function for long strings

阅读更多关于 String similarity in PHP: levenshtein like function for long strings

问题 The function levenshtein in PHP works on strings with maximum length 255. What are good alternatives to compute a similarity score of sentences in PHP. Basically I have a database of sentences, and I want to find approximate duplicates. similar_text function is not giving me expected results. What is the easiest way for me to detect similar sentences like below: $ss="Jack is a very nice boy, isn't he?"; $pp="jack is a very nice boy is he"; $ss=strtolower($ss); // convert to lower case as we

agrep: only return best match(es)

阅读更多关于 agrep: only return best match(es)

I'm using the 'agrep' function in R, which returns a vector of matches. I would like a function similar to agrep that only returns the best match, or best matches if there are ties. Currently, I am doing this using the 'sdist()' function from the package 'cba' on each element of the resulting vector, but this seems very redundant. /edit: here is the function I'm currently using. I'd like to speed it up, as it seems redundant to calculate distance twice. library(cba) word <- 'test' words <- c('Teest','teeeest','New York City','yeast','text','Test') ClosestMatch <- function(string,StringVector)

Regular Expression Arabic characters and numbers only

阅读更多关于 Regular Expression Arabic characters and numbers only

I want Regular Expression to accept only Arabic characters , Spaces and Numbers . Numbers are not required to be in Arabic. I found the following expression: ^[\u0621-\u064A]+$ which accepts only only Arabic characters while I need Arabic characters, Spaces and Numbers. Just add 1-9 (in Unicode format) to your character-class: ^[\u0621-\u064A0-9 ]+$ OR add \u0660-\u0669 to the character-class which is the range of Arabic numbers : ^[\u0621-\u064A\u0660-\u0669 ]+$ You can use: ^[\u0621-\u064A\s\p{N}]+$ \p{N} will match any unicode numeric digit. To match only ASCII digit use: ^[\u0621-\u064A\s0

Find matching strings between two vectors in R

阅读更多关于 Find matching strings between two vectors in R

问题 I have two vectors in R. I want to find partial matches between them. My Data The first one is from a dataset named muc, which contains 6400 street names. muc$name looks like: muc$name = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße",...) The other vector is d_vector. It contains around 1400 names. d_vector = "Abel", "Abendroth", "von Abercron", "Abetz", "Abicht", "Abromeit", ... I want to find all the street names, that contain a name from d_vector somewhere in the

R fuzzy string match to return specific column based on matched string

阅读更多关于 R fuzzy string match to return specific column based on matched string

问题 I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases / spellings etc. Apart from this address can be duplicated if written only till the building level. So different flats have the same address. I did some research and figured out the package stringdist

Javascript fuzzy search that makes sense

阅读更多关于 Javascript fuzzy search that makes sense

I'm looking for a fuzzy search JavaScript library to filter an array. I've tried using fuzzyset.js and fuse.js , but the results are terrible (there are demos you can try on the linked pages). After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don't know, the system calculates how many insertions , deletions , and substitutions are needed to make two strings match. One obvious flaw, which is fixed in the Levenshtein-Demerau model, is that both blub and boob are considered equally similar to bulb

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

阅读更多关于 High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance. I want to do fuzzy string comparison, but I'm not sure which library to use. Option 1: import Levenshtein Levenshtein.ratio('hello world', 'hello') Result: 0.625 Option 2: import difflib difflib.SequenceMatcher(None, 'hello world', 'hello').ratio() Result: 0.625 In this example both give the same answer. Do you think both perform alike in this case? In case you're interested in a quick visual comparison of

Regex for existence of some words whose order doesn't matter

阅读更多关于 Regex for existence of some words whose order doesn't matter

I would like to write a regex for searching for the existence of some words, but their order of appearance doesn't matter. For example, search for "Tim" and "stupid". My regex is Tim.*stupid|stupid.*Tim . But is it possible to write a simpler regex (e.g. so that the two words appear just once in the regex itself)? Unihedron See this regex: /^(?=.*Tim)(?=.*stupid).+/ Regex explanation: ^ Asserts position at start of string. (?=.*Tim) Asserts that "Tim" is present in the string. (?=.*stupid) Asserts that "stupid" is present in the string. .+ Now that our phrases are present, this string is valid