fuzzy-search

pandas extract regex allowing mismatches

人盡茶涼 提交于 2021-02-17 03:30:16
问题 Pandas has a very fast and nice string method, extract(). This method works perfectly with a regex such as this one: strict_pattern = r"^(?P<pre_spacer>ACGAG)(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT)" test_df R1 21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG 22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG 23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT 24 ACGAGAATAACGTTTGGTGGAGTCTACCAC 25 ACGAGGGGAATAAATATTGGAGTCTCCTCC 26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC 27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG 28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA 29

Generate misspelled words (typos)

别来无恙 提交于 2021-02-07 15:53:35
问题 I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data. Let's say I have a document containing the text: {"text": "The quick brown fox jumps over the lazy dog"} I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog". In other words, I want to add noise to strings to generate misspelled words (typos). What would be a way of automatically generating words with

Generate misspelled words (typos)

三世轮回 提交于 2021-02-07 15:50:20
问题 I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data. Let's say I have a document containing the text: {"text": "The quick brown fox jumps over the lazy dog"} I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog". In other words, I want to add noise to strings to generate misspelled words (typos). What would be a way of automatically generating words with

Python regex's fuzzy search doesn't return all matches when using the or operator

对着背影说爱祢 提交于 2021-02-05 05:50:09
问题 For example, when I use regex.findall(r"(?e)(mazda2 standard){e<=1}", "mazda 2 standard") , the answer is ['mazda 2 standard'] as usual. But when I use regex.findall(r"(?e)(mazda2 standard|mazda 2){e<=1}", "mazda 2 standard") or regex.findall(r"(?e)(mazda2 standard|mazda 2){e<=1}", "mazda 2 standard", overlapped=True) , the output doesn't contain 'mazda 2 standard' at all. How to make the output contain 'mazda 2 standard' too? 回答1: See PyPi regex documentation: By default, fuzzy matching

Fuzzy string matching Excel

空扰寡人 提交于 2021-01-28 05:50:39
问题 I am currently in need of a fuzzy string matching algorithm. I found one VBA code from this link given here: Fuzzy Matching. Function FuzzyFind(lookup_value As String, tbl_array As Range) As String Dim i As Integer, str As String, Value As String Dim a As Integer, b As Integer, cell As Variant For Each cell In tbl_array str = cell For i = 1 To Len(lookup_value) If InStr(cell, Mid(lookup_value, i, 1)) > 0 Then a = a + 1 cell = Mid(cell, 1, InStr(cell, Mid(lookup_value, i, 1)) - 1) & Mid(cell,

ElasticSearch - cross_fields multi match with fuzzy search

二次信任 提交于 2020-12-27 17:04:25
问题 I have documents that represent users. They have fields name and surname . Let's say I have two users indexed - Michael Jackson and Michael Starr. I want these sample searches to work: Michael => { Michael Jackson , Michael Starr } Jack Mich => { Michael Jackson } (incomplete words and reversed order) Michal Star => { Michael Starr } (fuzzy search) I tried different queries and got the best results from multi_match query with cross_fields type. There are 2 problems though: It only finds

ElasticSearch - cross_fields multi match with fuzzy search

混江龙づ霸主 提交于 2020-12-27 17:04:12
问题 I have documents that represent users. They have fields name and surname . Let's say I have two users indexed - Michael Jackson and Michael Starr. I want these sample searches to work: Michael => { Michael Jackson , Michael Starr } Jack Mich => { Michael Jackson } (incomplete words and reversed order) Michal Star => { Michael Starr } (fuzzy search) I tried different queries and got the best results from multi_match query with cross_fields type. There are 2 problems though: It only finds