string-matching | 易学教程

Finding how similar two strings are

阅读更多关于 Finding how similar two strings are

问题 I\'m looking for an algorithm that takes 2 strings and will give me back a \"factor of similarity\". Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have. This is not for searching in a database. I\'ll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow. I know this exists, i\'ve seen it before, but I can\'t remember its

How to search a specific value in all tables (PostgreSQL)?

阅读更多关于 How to search a specific value in all tables (PostgreSQL)?

Is it possible to search every column of every table for a particular value in PostgreSQL? A similar question is available here for Oracle. How about dumping the contents of the database, then using grep ? $ pg_dump --data-only --inserts -U postgres your-db-name > a.tmp $ grep United a.tmp INSERT INTO countries VALUES ('US', 'United States'); INSERT INTO countries VALUES ('GB', 'United Kingdom'); The same utility, pg_dump, can include column names in the output. Just change --inserts to --column-inserts . That way you can search for specific column names, too. But if I were looking for column

How can I match fuzzy match strings from two datasets?

阅读更多关于 How can I match fuzzy match strings from two datasets?

问题 I\'ve been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS. So far AGREP is the closest tool I\'ve found that might work. I can use levenshtein distances in the AGREP package, which measure the

Regular Expression Arabic characters and numbers only

阅读更多关于 Regular Expression Arabic characters and numbers only

问题 I want Regular Expression to accept only Arabic characters , Spaces and Numbers . Numbers are not required to be in Arabic. I found the following expression: ^[\\u0621-\\u064A]+$ which accepts only only Arabic characters while I need Arabic characters, Spaces and Numbers. 回答1: Just add 1-9 (in Unicode format) to your character-class: ^[\u0621-\u064A0-9 ]+$ OR add \u0660-\u0669 to the character-class which is the range of Arabic numbers : ^[\u0621-\u064A\u0660-\u0669 ]+$ 回答2: You can use: ^[

Regex for existence of some words whose order doesn't matter

阅读更多关于 Regex for existence of some words whose order doesn't matter

问题 I would like to write a regex for searching for the existence of some words, but their order of appearance doesn\'t matter. For example, search for \"Tim\" and \"stupid\". My regex is Tim.*stupid|stupid.*Tim . But is it possible to write a simpler regex (e.g. so that the two words appear just once in the regex itself)? 回答1: See this regex: /^(?=.*Tim)(?=.*stupid).+/ Regex explanation: ^ Asserts position at start of string. (?=.*Tim) Asserts that "Tim" is present in the string. (?=.*stupid)

Javascript fuzzy search that makes sense

阅读更多关于 Javascript fuzzy search that makes sense

问题 I\'m looking for a fuzzy search JavaScript library to filter an array. I\'ve tried using fuzzyset.js and fuse.js, but the results are terrible (there are demos you can try on the linked pages). After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don\'t know, the system calculates how many insertions , deletions , and substitutions are needed to make two strings match. One obvious flaw, which is

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

阅读更多关于 High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

问题 I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance. I want to do fuzzy string comparison, but I\'m not sure which library to use. Option 1: import Levenshtein Levenshtein.ratio(\'hello world\', \'hello\') Result: 0.625 Option 2: import difflib difflib.SequenceMatcher(None, \'hello world\', \'hello\').ratio() Result: 0.625 In this example both give the same

Return positions of a regex match() in Javascript?

阅读更多关于 Return positions of a regex match() in Javascript?

问题 Is there a way to retrieve the (starting) character positions inside a string of the results of a regex match() in Javascript? 回答1: exec returns an object with a index property: var match = /bar/.exec("foobar"); if (match) { console.log("match found at " + match.index); } And for multiple matches: var re = /bar/g, str = "foobarfoobar"; while ((match = re.exec(str)) != null) { console.log("match found at " + match.index); } 回答2: Here's what I came up with: // Finds starting and ending

Filter multiple values on a string column in dplyr

阅读更多关于 Filter multiple values on a string column in dplyr

问题 I have a data.frame with character data in one of the columns. I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I\'m missing? Example: data.frame name = dat days name 88 Lynn 11 Tom 2 Chris 5 Lisa 22 Kyla 1 Tom 222 Lynn 2 Lynn I\'d like to filter out Tom and Lynn for example. When I do: target <- c(\"Tom\", \"Lynn\") filt <- filter(dat, name == target) I get this error: longer object length is not a multiple of shorter object

A better similarity ranking algorithm for variable length strings

阅读更多关于 A better similarity ranking algorithm for variable length strings

问题 I\'m looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc). For example, Given string A: \"Robert\", Then string B: \"Amy Robertson\" would be a better match than String C: \"Richard\" Also, preferably, this algorithm should be language agnostic (also works in languages other than English). 回答1: Simon White of Catalysoft wrote an article about a very clever algorithm that